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Preface 





I write this book to introduce R—a language and environment for statistical com- 
puting and graphics—to epidemiologists and health data analysts conducting epi- 
demiologic studies. From my experience in public health practice, sometimes even 
formally trained epidemiologists lack the breadth of analytic skills required at health 
departments where resources are very limited. Recent graduates come prepared with 
a solid foundation in epidemiological and statistical concepts and principles and 
they are ready to run a multivariable analysis (which is not a bad thing we are grate- 
ful for highly trained staff). However, what is sometimes lacking is the practical 
knowledge, skills, and abilities to collect and process data from multiple sources 
(e.g., Census data; reportable diseases, death and birth registries) and to adequately 
implement new methods they did not learn in school. One approach to implementing 
new methods is to look for the “commands” among their favorite statistical pack- 
ages (or to buy a new software program). If the commands do not exist, then the 
method may not be implemented. In a sense, they are looking for a custom-made 
solution that makes their work quick and easy. 

In contrast to custom-made tools or software packages, R is a suite of basic tools 
for statistical programming, analysis, and graphics. One will not find a “command” 
for a large number of analytic procedures one may want to execute. Instead, R is 
more like a set of high quality carpentry tools (hammer, saw, nails, and measuring 
tape) for tackling an infinite number of analytic problems, including those for which 
custom-made tools are not readily available or affordable. I like to think of R as a 
set of extensible tools to implement one’s analysis plan, regardless of simplicity or 
complexity. With practice, one not only learns to apply new methods, but one also 
develops a depth of understanding that sharpens one’s intuition and insight. With 
understanding comes clarity, focused problem-solving, and confidence. 

This book is divided into two parts. First, I cover how to process, manipulate, and 
operate on data in R. Most books cover this material briefly or leave it for an ap- 
pendix. I decided to dedicate a significant amount of space to this topic with the as- 
sumption that the average epidemiologist is not familiar with R and a good ground- 
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ing in the basics will make the later chapters more understandable. Second, I cover 
basic epidemiology topics addressed in most books but we infuse R to demonstrate 
concepts and to exercise your intuition. Readers may notice a heavier emphasis on 
descriptive epidemiology which is what is more commonly used at health depart- 
ments, at least as a first step. In this section we do cover regression methods and 
graphical displays. I have also included “how to” chapters on a diversity of topics 
that come up in public health. My goal is not to be comprehensive in each topic but 
to demonstrate how R can be used to implement a diversity of methods relevant to 
public health epidemiology and evidence-based practice. 

To help us spread the word, this book is available on the World Wide Web 
(http: //www.medepi.com). I do not want financial or geographical barriers 
to limit access to this material. I am only presenting what I have learned from the 
generosity of others. My hope is that more and more epidemiologists will embrace 
R for epidemiological applications, or at least, include it in their toolbox. 


Berkeley, California Tomas J. Aragon 
October 14, 2013 
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CHAPTER 1 





Getting Started With R 





1.1 What is R? 


R is a freely available “computational language and environment for data analysis 
and graphics.” R is indispensable for anyone that uses and interprets data. As medi- 
cal, public health, and research epidemiologists, we use R in the following ways: 


e Full-function calculator 

e Extensible statistical package 

e High-quality graphics tool 

e Multi-use programming language 

We use R to explore, analyze, and understand epidemiological data. We analyze 
data straight out of tables provided in reports or articles as well as analyze usual 
data sets. The data might be a large, individual-level data set imported from another 
source (e.g., cancer registry); an imported matrix of group-level data (e.g, popula- 
tion estimates or projections); or some data extracted from a journal article we are 
reviewing. The ability to quantitatively express, graphically explore, and describe 
epidemiologic data and processes enables one to work and strengthen one’s epi- 
demiologic intuition. 

In fact, we only use a very small fraction of the R package. For those who develop 
an interest or have a need, R also has many of the statistical modeling tools used by 
epidemiologists and statisticians, including logistic and Poisson regression, and Cox 
proportional hazard models. However, for many of these routine statistical models, 
almost any package will suffice (SAS, Stata, SPSS, etc.). The real advantage of R 
is the ability to easily manipulate, explore, and graphically display data. Repetitive 
analytic tasks can be automated or streamlined with the creation of simple functions 
(programs that execute specific tasks). The initial learning curve is steep, but in the 
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long run one is able to conduct analyses that would otherwise require a tremendous 
amounts of programming and time. 

Some may find R challenging to learn if they are not familiar with statistical 
programming. R was created by statistical programmers and is more often used 
by analysts comfortable with matrix algebra and programming. However, even for 
those unfamiliar with matrix algebra, there are many analyses one can accomplish 
in R without using any advanced mathematics, which would be difficult in other 
programs. The ability to easily manipulate data in R will allow one to conduct good 
descriptive epidemiology, life table methods, graphical displays, and exploration of 
epidemiologic concepts. R allows one to work with data in any way they come. 


1.1.1 Who should learn R? 


Anyone that uses a calculator or spreadsheet, or analyzes numerical data at least 
weekly should seriously consider learning and using R. This includes epidemiol- 
ogists, statisticians, physician researchers, engineers, and faculty and students of 
mathematics and science courses, to name just a few. We jokingly tell our staff an- 
alysts that once they learn R they will never use a spreadsheet program again (well 
almost never!). 


1.1.2 Why should I learn R? 


To implement numerical methods we need a computational tool. On one end of the 
spectrum are calculators and spreadsheets for simple calculations, and on the other 
end of the spectrum are specialized computer programs for such things as statistical 
and mathematical modeling. However, many numerical problems are not easily han- 
dled by these approaches. Calculators, and even spreadsheets, are too inefficient and 
cumbersome for numerical calculations whose scope and scale change frequently. 
Statistical packages are usually tailored for the statistical analysis of data sets and 
often lack an intuitive, extensible, open source programming language for tackling 
new problems efficiently.! R can do the simplest and the most complex analysis 
efficiently and effectively. 

When we learn and use R regularly we will save significant amounts of time and 
money. It’s powerful and it’s free! It’s a complete environment for data analysis and 
graphics. Its straightforward programming language facilitates the development of 
functions to extend and improve the efficiency of our analyses. 


' Read my recommendations for mostly free and open source software (FOSS) at medepi .com. 
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1.1.3 Where can I get R? 


R is available for many computer platforms, including Mac OS, Linux, Microsoft 
Windows, and others. R comes as source code or a binary file. Source code needs to 
be compiled into an executable program for our computer. Those not familiar with 
compiling source code (and that’s most of us) just install the binary program. We 
assume most readers will be using R in the Mac OS or MS Windows environment. 
Listed here are useful R links: 


e R Project home page at http://www. r-project.org/; 
e R download page at http: //cran.r-project.org; 


e Numerous free tutorials are at http://cran.r-project.org/other-docs. 


html; 

e R Wikibook at http://en.wikibooks.org/wiki/R_Programming; 
and 

e R Journal at http://journal.r-project.org/. 


To install R for Windows, do the the following: 


— 


Go to http: //www.r-project.org/; 

From the left menu list, click on the “CRAN” (Comprehensive R Archive Net- 
work) link; 

Select a nearby geographic site (e.g., ht tp: //cran.cnr.berkeley.edu/); 
Select appropriate operating system; 

Select on “base” link; 

For Windows, save R-X.X.X-win32.exe to the computer; and for Mac OS, 
save the R-X.X.X-mini.dmg disk image. 

7. Run the installation program and accept the default installation options. That’s 
it! 


i 


Des Se 


1.2 How do I use R? 
1.2.1 Using R on our computer 


Use R by typing commands at the R command line prompt (>) and pressing Enter on 
our keyboard. This is how to use R interactively. Alternatively, from the R command 
line, we can also execute a list of R commands that we have saved in a text file (more 
on this later). Here is an example of using R as a calculator: 


> 8x4 

[1] 32 

> (4 + 6)73 
[1] 1000 
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Use the scan function to enter data at the command line. At the end of each line 
press the Enter key to move to the next line. Pressing the Enter key twice completes 
the data entry. 


> quantity <- scan() 

1: 34 56 22 

4: 

Read 3 items 

> price <- scan () 

Vs 19295-1495. 10.99 

4: 

Read 3 items 

> total <- quantity*price 

> chind(quantity, price, total) 
quantity price total 


[1,] 34 19.95 678.30 
[2, ] 56 14.95 837.20 
[3,] 22 10.99 241.78 


1.2.2 Does R have epidemiology programs? 


The default installation of R does not have packages that specifically implement epi- 
demiologic applications; however, many of the statistical tools that epidemiologists 
use are readily available, including statistical models such as unconditional logis- 
tic regression, conditional logistic regression, Poisson regression, Cox proportional 
hazards regression, and much more. 

To meet the specific needs of public health epidemiologists and health data ana- 
lysts, I maintain a freely available suite of Epidemiology Tools: the epitools R 
package can be directly installed from within R. 

For example, to evaluate the association of consuming jello with a diarrheal ill- 
ness after attending a church supper in 1940, we can use the epitab function from 
the epitools package. In the R code that follows, the # symbol precedes com- 
ments that are not evaluated by R. 





> library (epitools) #load ’epitools’ package 
> data (oswego) #load Oswego dataset 
> attach (oswego) #attach dataset 
> round(epitab(jello, ill, method = "riskratio")$tab, 3) 
Outcome 
Predictor N po Y pl riskratio lower upper p.value 
N 22 0.423 30 0.577 1.000 NA NA NA 
Y 7 0.304 16 0.696 1.206 0.844 1.723 0.442 
> round(epitab(jello, ill, method = "oddsratio") Stab, 3) 
Outcome 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


1.2 How do I use R? 7 


Predictor N po Y pl oddsratio lower upper p.value 
N 22 0.759 30 0.652 1.000 NA NA NA 
Y 7 0.241 16 0.348 1.676 0.59 4.765 0.442 
> detach (oswego) #detach dataset 


The risk of illness among those that consumed jello was 69.6% compared to 57.7% 
among those that did not consume jello. Both the risk ratio and the odds ratio were 
elevated but we cannot exclude random error as an explanation (p value = 0.44). 
We also notice that the odds ratio is not a good approximation to the risk ratio. This 
occurs when the risks among the exposed and unexposed is greater than 10%. In 
this case, the risks of diarheal illness were 69.6% and 57.7% among exposed and 
nonexposed, respectively. 


1.2.3 How should I use these notes? 


The best way to learn R is to use it! Use it as our calculator! Use it as our spread- 
sheet! Finally read these notes sitting at a computer and use R interactively (this 
works best sitting in a cafe that brews great coffee and plays good music). In this 
book, when we display R code it appears as if we are typing the code directly at the 
R command line: 


> x <- matrix(1:9, nrow = 3, ncol = 3) 
> x 
ke Ld Le 2) Ee] 
[ds,-] 1 4 7 
[2,] 2 5 8 
[342] 3 6 9 


Sometimes the R code appears as it would in a text editor (e.g., Notepad) before it 
has been evaluated by R. When a text editor version of R code is displayed it appears 
without the command line prompt and without output: 


x <- matrix(1:9, nrow = 3, ncol = 3) 
x 


When the R code displayed exceeds the page width it will continue on the next 
line but indented. Here’s an example: 


agegrps <- c("Age < 1", "Age 1 to 4", "Age 5 to 14", "Age 15 
to 24", "Age 25 to 44", "Age 45 to 64", "Age 64+") 


Although we encourage initially to use R interactively by typing expressions 
at the command line, as a general rule, it is much better to type our code into a 
text editor. We save our code with a convenient file name such as job01.R*. For 


? The .R extension, although not necessary, is useful when searching for R command files. Addi- 
tionally, this file extension is recognized by some text editors. 
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Operator Description Try these examples 
+ addition 5+4 
- subtraction 5-4 
* multiplication 5x4 
division 5/4 
. exponentiation 574 
- unary minus (change current sign) -5 
abs absolute value abs (-23) 
exp exponentiation (e to a power) exp (8) 
log logarithm (default is natural log) log (exp (8) ) 
sqrt square root sqrt (64) 
%ol% integer divide 10%/%3 
%% modulus 10%% 
%0*% matrix multiplication xx <- matrix(1:4, 2, 
2) 


convenience, R comes with its own text editor. For Windows, under File, select New 
script to open an empty text file. For Mac OS, under File, select New Document. 
As before, save this text file with a .R extension. Within R’s text editor, we can 
highlight and run selected commands. 

The code in our text editor can be run in the following ways: 


e Highlight and run selected command in the R editor; 
e Paste the code directly into R at the command line; 
e Run the file in batch mode from the R command line using the source ("job01.R"). 


1.3 Just do it! 


1.3.1 Using R as your calculator 


Open R and start using it as our calculator. The most common math operators are 
displayed in Table 1.1. From now on make R our default calculator! Study the exam- 
ples and spend a few minutes experimenting with R as a calculator. Use parentheses 
as needed to group operations. Use the keyboard Up-arrow to recall what we previ- 
ously entered at the command line prompt. 


Applied Epidemiology Using R 
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Table 1.2 Types of evaluable expressions® 





Expression type Try these examples 





literal "hello, I’m John Snow" #character 
3.5 #numeric 
TRUE; FALSE #logical 

math operation 6«7 


assignment x <- 4x4 
data object x 
function log (x) 





“ lines preceded with # are not evaluated 





1.3.2 Useful R concepts 


1.3.2.1 Types of evaluable expressions 


Every expression that is entered at the R command line is evaluated by R and returns 
a value. A literal is the simplist expression that can be evaluated (number, charac- 
ter string, or logical value). Mathematical operations involve numeric literals. For 
example, R evaluates the expression 4 4 and returns the value 16). The exception 
to this is when an evaluable expression is assigned an object name: x <- 44. To 
display the assigned expression, wrap the expression in parentheses: (x <-— 4*4), 
or type the object name. 


> 4x4 

[1] 16 

> x <- 4x4 
> x 
[1] 16 

> (x <- 4x4) 

[1] 16 

Finally, evaluable expressions must be separated by either newline breaks or a semi- 
colon. 


> x <-— 4x4; x 
[1] 16 


Table 1.2 summarizes evaluable expressions. 


1.3.2.2 Using the assignment operator 
Most calculators have a memory function: the ability to assign a number or numer- 


ical result to a key for recalling that number or result at a later time. The same is 
true in R but it is much more flexible. Any evaluable expression can be assigned a 
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name and recalled at a later time. We refer to these variables as data objects. We use 
the assignment operator (<—) to name an evaluable expression and save it as a data 
object. 


> xx <- "hello, what’s your name" 
> XX 
[1] "hello, what’s your name" 


Wrapping the assignment expression in parentheses makes the assignment and dis- 
plays the data object value(s). 


S yy Ka 5°53 #assignment; no display 
> (yy <- 5°3) #assignment; displays evaluation 
[1] 125 


In this book, we might use (yy <— 5%3) to display the value of yy and save 
space on the page. In practice, this is more common: 


> yy <- 5°73 


> yy 
[1] 125 


Multiple assignments work and are read from right to left: 


> aa <- bb <- 99 


> aa; bb 
[1] 99 
[1] 99 


Finally, the equal sign (=) can be used for assignments, although I prefer and the <— 
symbol: 

> ages = c(34, 45, 67) 

> ages 

[1] 34 45 67 


The reason I prefer <— for assigning object names in the workspace is because 
later we use = for assigning values to function arguments. For example, 


> x <- 20:25 #object name assignment 
> x 

[1] 20 21 22 23 24 25 

> sample(x = 1:10, size = 5) #argument assignments 
[a (9. 6) 32-5 

> x 

[1] 20 21 22 23 24 25 


The first x is an object name assignment in the workspace which persist during the R 
session. The second x is a function argument assignment which is only recognized 
locally in the function and only for the duration of the function execution. For clarity, 
it is better to keep these types of assignments separate in our mind by using different 
assignment symbols. 
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Study these previous examples and spend a few minutes using the assignment 
operator to create and call data objects. Try to use descriptive names if possible. For 
example, suppose we have data on age categories; we might name the data agecat, 
age.cat,or age_cat?. These are all acceptable. 


1.3.3 Useful R functions 


When we start R you have opened a workspace. The first time we use R, the 
workspace is empty. Every time we create a data object, it is in the workspace. 
If a data object with the same name already exists, the old data object will be over- 
written withou warning, so be careful. To list the objects in your workspace use the 
1s or objects functions: 


> ls() ##display workspace objects 
character (0) 

x <- 1:5 

1s() 


Hs Wu 


x 


Li Py 25354 
<- 10:15 ##overwrites without warning 


x 


Py LOT 2s 138. a= ES 





> 
> 
[ 
> 
[ 
> 
> 
[ 


Data objects can be saved between sessions. We will be prompted with “Save 
workspace image?” (You can also use save. image () at the command line.) The 
workspace image is saved in a file called .RData.* Use get wd () to display the 
file path to the .RData file. Table 1.3 on the following page has more useful R 
functions. 


1.3.3.1 What are packages? 


R has many available functions. When we open R, several packages are attached by 
default. Each package has its own suite of functions. To display the list of attached 
packages use the search function. 


> search() # Linux 
[1] ".GlobalEnv" "package:stats" "package: graphics" 





3 In older versions of R, the underscore symbol (_) could be used for assignments, but this is no 
longer permitted. The “_’ symbol can be used to make your object name more readable and is 
consistent with other software. 

4 In some operating systems files names that begin with a period (.) are hidden files and are not 


displayed by default. You may need to change the viewing option to see the file. 
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Table 1.3 Useful R functions 











Function Description Try these examples 
q Quit R q() 
Is List objects 1s () 
objects objects() #equivalent 
rm Remove object(s) yy <- 1:5; 1s() 
remove rm(yy); 1ls() 
#remove everything: 
caution! 
rm(list = ls()); 1s() 
help Open help instructions; or get help help () 
on specific topic. help (plot) 
?plot #equivalent 
help.search Search help system given character help.search ("print") 
string 
help.start Start help browser help.start () 
apropos Displays all objects in the search list apropos (plot) 
matching topic 
getwd Get working directory getwd () 
setwd Set working directory setwd("c:\mywork\rproject") 
args Display arguments of function args (sample) 
example Runs example of function example (plot) 
data Information on available R data sets; data() #displays data sets 
load data set data(Titanic) #load data 
set 
save.image Saves current workspace to .RData save. image () 
[4] "package:grDevices" "package:utils" "package: datasets" 
[7] "“package:methods" "Autoloads" "package:base" 


To display the file paths to the packages use the searchpaths function. 


searchpaths() # Linux 





























> 

[1] ".GlobalEnv" "/usr/lib/R/library/stats" 

[3] "/usr/lib/R/library/graphics" "/usr/lib/R/library/grDevices" 
[5] "/usr/lib/R/library/utils" "/usr/lib/R/library/datasets" 
[7] "/usr/lib/R/library/methods" "Autoloads" 

[9] "/usr/lib/R/library/base" 











To learn more about a specific package enter Library (help=package_name) . 
Alternatively, we can get more detailed information by entering help.start () 
which opens the HTML help page. On this page click on the Packages link to see the 
available packages. If we need to load a package enter library (package_name) . 
For example, when we cover survival analysis we will need to load the survival 
package. 
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1.3.3.2 What are function arguments? 


We will be using many R functions for data analysis, so we need to know some 
function basics. Suppose we are interested in taking a random sample of days from 
the month of June, which has 30 days. We want to use the {sample} function but 
we forgot the syntax. Let’s explore: 


> sample 
function (x, size, replace = FALSE, prob = NULL) 
{ 





if (length(x) == 1 && x >= 1) { 
if (missing(size) ) 
size <- x 
.Internal(sample(x, size, replace, prob) ) 
} 
else { 
if (missing(size) ) 
size <- length (x) 
x[.Internal (sample (length(x), size, replace, prob)) ] 


} 


<environment: namespace:base> 


Whoa! What happened? Whenever we type the function name without any parenthe- 
ses it usually returns the whole function code. This is useful when we start program- 
ming and we need to alter an existing function, borrow code for our own functions, 
or study the code for learning how to program. If we are already familiar with the 
sample function we may only need to see the syntax of the function arguments. 
For this we use the args function: 


> args (sample) 
function (x, size, replace = FALSE, prob = NULL) 
NULL 





The terms x, size, replace, and prob are the function arguments. First, notice 
that replace and prob have default values; that is, we do not need to specify 
these arguments unless we want to override the default values. Second, notice the 
order of the arguments. If you enter the argument values in the same order as the 
argument list we do not need to specify the argument. 


> dates <- 1:30 
> sample(dates, 18) 
[a ].-29 22° 28~ <6. 5. (2 £3 2420048 Td a -23° 4d do. 4.722. 9 


Third, if we enter the arguments out of order then we will get either an error mes- 
sage or an undesired result. Arguments entered out of their default order need to be 
specified. 
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> sample(18, dates) #gives undesired results 

[1] 2 

> #No! We wanted sample of size = 18 

> sample(size = 18, x = dates) #gives desired result 


Pde), SO 29% 2 TS. Ya 2AS 19 20 bs 224 1 2-aa 23-25: 1518: Ae 


Fourth, when we specify an argument we only need to type a sufficient number of 
letters so that R can uniquely identify it from the other arguments. 


> sample(s = 18, x = dates, r = T) #sampling with replacement 
[2]. 23° LO 23°27. 13. 74. DL P2326 28" 3°23. 2:3 O.. .6° 23> 5 


Fifth, argument values can be any valid R expression (including functions) that 
yields to an appropriate value. In the following example we see two sample func- 
tions that provide random values to the sample function arguments. 


> sample(s = sample(1:100, 1), x = sample(1:10, 5), r=T) 
Paje Bi -Ab 9, Bh Bp YOSHO 2B. Ore Be DOr 84. 39h. 2 Beh WGee AS 


Finally, if we need more guidance on how to use the sample function enter 
?sample orhelp (sample). 


1.3.4 How do I get help? 


R has extensive help capabilities. From the main menu select Help to get you started 
(Figure 1.1 on the next page). The Frequently Asked Questions (FAQ) and R man- 
uals are available from this menu. The R functions (text)..., Html help, Search 
help..., and Apropos. .. selections are also available from the command line. 

From the command line, we have several options. Suppose you are interested in 
learning abouting help capabilities. 














> ?help #opens help page for '’help’ function 

> help () #opens help page for '’help’ function 

> help (help) #opens help page for '’help’ function 

> help.start () #starts HTML help page 

> help.search("help") #searches help system for "help" 

> apropos ("help") #displays ’help’ objects in search list 
> apropos ("help") 

[1] "help" "help.request" "help.search" "“help.start" 


To learn about about available data sets use the data function: 








> data () #displays available data sets 
> try (data(package = "survival")) #lists survival pkg data sets 
> help(pbc, package = "Survival") #displays pbc data help page 


Figure 1.1 on the facing page shows that a R Graphical User Interface (GUI) main 
menu will have a Help selection. 
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File Edit Misc Packages Windows | Help 


[: Console 


FAQ onR 


FAQ on R for Windows 











Ta ae Bad 4n Introduction to R 
: i |e eee | oRReference| Manual 
Ri Copyright 2005, The | R functions (text)... ee 
Version 2.1.0 (2005-04- R Data Import/Export 
Html help Lan? 
5 Prel R Language Definition 
R is free software and cr scotia Writing R Extensions 
You are welcome to redis’ Apropos... R Installation and Administration 
Type 'license()' or '|'lic\= 
R Project home page 
Natural language suppo: CRAN home page English locale 
; . About ; 
R is a collaborative pro ibutors. 
Type ‘contributors()' for more information and 


Fig. 1.1 Select Help from the main menu in the MS Windows R GUL 


1.3.5 RStudio—An integrated development environment for R 


RStudio® is a free and open source integrated development environment (IDE) for 
R that runs on any desktop (Windows, Mac, or Linux) (Figure 1.2 on the next page). 
For our purposes, I highly recommend the installation of RStudio: it has all the tools 
we need to learn and apply R. 


1.3.6 Is there anything else that I need? 


Maybe.° A good text editor will make your programming and data processing easier 
and more efficient. A text editor is a program for, we guessed it, editing text! The 
functionality we look for in a text editor are the following: 


Toggle between wrapped and unwrapped text 

Block cutting and pasting (also called column editing) 
Easy macro programming 

Search and replace using regular expressions 

Ability to import large datasets for editing 


FOO 


When we are programming we want our text to wrap so we can read all of your 
code. When we import a data set that is wider than the screen, we do not want the 
data set to wrap: we want it to appear in its tabular format. Column editing allows 
us to cut and paste columns of text at will. A macro is just a way for the text editor 
to learn a set of keystrokes (including search and replace) that can be executed 


Shttp://www.rstudio.org/ 
© If your only goal is to learn R, then RStudio is more than sufficient. 
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Help 

Cig age 

@) diamondPricing.R* PjformatPiot.R x _| diamonds » =| Workspace History 

& PSourceonsave Q fi- = Sm | eSource ~| 2 <Ploady [J Saver _fP Import Datasety yf Clear All 
library(ggplot2) *| Data 
SOUPESE Plats /FOrmarRIOE A ) diamonds 53940 obs. of 10 variables 
view(diamonds ) Values 
summary (diamonds) avesize 0.7979 


summary (diamonds $price) clarity character [8] 
avesize <- round(mean(diamondsScarat), 4) Pp ggplot [8] 
clarity <- levels(diamonds$clarity) 


B u 
DOMVAUEWNR 


Functions 
p <- qplot(carat, price, format.plot(plot, size) 
data-diamonds, color=clarity, 
arat", ylab="Price”, 
main="Diamond Pricing”) 
Files Plots Packages Help 


ga 2 zoom AlExporty @) | y¥ Clear Alt 


u Diamond Pricing 


Gi (op Level) = R Script + ay 
. 


Console af f Fi 
x y z (a p, pr * Clarity 
Min. 0.000 in. 2 0. min. 000 PORES 
ist Qu.: 4.710 in the ist Qu.: 910 
Median : 5.700 i i Median : 3.530 
Mean EY 
3rd Qu.: 6.540 be 3rd Qu.: 
Max. 710.740 758. Max. 
> summary (diamonds $price) 
Min. Ist Qu. Median Mean 3rd Qu. Max. 
326 950 2401 3933 5324 18820 
> aveSize <- round(mean(diamonds$carat), 4) 
> clarity <- levels(diamonds$clarity) 
> p <- qplot(carat, price, 
data=diamonds, color=clarity, 


040 


0. 
2. 
3. 
~ 731 . Mean 2 3.539 
4. 
731.800 


xlab="Carat", ylab="Price”, 
main="Diamond Pricing”) 


format.plot(p, size=24) 


VVVtt+ 





Fig. 1.2  RStudio—An integrated development environment for R that runs in Linux, Mac OS, or 
MS Windows. In this Figure, RStudio is running in MS Windows. 


Fig. 1.3 Emacs and ESS in the Mac OS 


as needed. Searching using regular expressions means searching for text based on 
relative attributes. For example, suppose you want to find all words that begin with 
“b,” end with “g,” have any number of letters in between but not “r’ and “f-’ Regular 
expression searching makes this a trivial task. These are powerful features that once 
we use regularly, we will wonder how we ever got along without them. 

If we do not want to install a text editing program then we can use the default 
text editor that comes with our computer operating system (gedit in Ubuntu Linux, 
TextEdit in Mac OS, Notepad in Windows). However, it is much better to install a 
text editor that works with R. My favorite text editor is the free and open source 
GNU Emacs. GNU Emacs can be extended with the “Emacs Speaks Statistics” 
(ESS) package. For more information on Emacs and ESS pre-installed for Win- 
dows, visit http: //ess.r-project.org/. For the Mac OS, I recommend 
GNU Emacs for Mac OS? (Figure 1.3) or Aquamacs.® 


Thttp://emacsformacosx.com/ 
8 http: //aquamacs.org/ 
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Table 1.4 Types of actions taken on R data objects and where to find examples 








Action Vector Matrix Array List Data Frame 
Creating Table 2.4 Table 2.11 Table 2.18 Table 2.24 Table 2.30 
(p. 34) (p. 50) (p. 65) (p. 79) (p. 88) 
Naming Table 2.5 Table 2.12 Table 2.19 Table 2.25 Table 2.31 
(p. 37) (p. 54) (p. 68) (p. 80) (p. 89) 
Indexing Table 2.6 Table 2.13 Table 2.20 Table 2.26 Table 2.32 
(p. 39) (p. 56) (p. 69) (p. 80) (p. 90) 
Replacing Table 2.7 Table 2.14 Table 2.21 Table 2.27 Table 2.33 
(p. 41) (p. 57) (p. 69) (p. 82) (p. 93) 
Operating on _—_— Table 2.8 Table 2.15 Table 2.22 Table 2.28 Table 2.34 
(p. 42) (p. 58) (p. 70) (p. 83) (p. 93) 
Table 2.9 
(p. 46) 





1.3.7 What’s ahead? 


To the novice user, R may seem complicated and difficult to learn. In fact, for its 
immense power and versatility, R is easier to learn and deploy compared to other 
statistical software (e.g. SAS, Stata, SPSS). This is because R was built from the 
ground up to be an efficient and intuitive programming language and environment. 
If one understands the logic and structure of R, then learning proceeds quickly. Just 
like a spoken language, once we know its rules of grammar, syntax, and pronuncia- 
tion, and can write legible sentences, we can figure out how to communicate almost 
anything. Before the next chapter, we want to describe the “forest”: the logic and 
structure of working with R objects and epidemiologic data. 


1.3.7.1 Working with R objects 


For our purposes, there are only five types of data objects in R® and five types of 
actions we take on these objects (Table 1.4). That’s it! No more, no less. You will 
learn to create, name, index (subset), replace components of, and operate on these 
data objects using a systematic, comprehensive approach. As you learn about each 
new data object type, it will reinforce and extend what you learned previously. 

A vector is a collection of elements (usually numbers): 


Suk See (1s. 2y Sp ae Sp Gp Up 383. 9 - LOy LL Ae) 
> & 
Pde pd. 22> 28h A a OT 8 GEL O a 12 


A matrix is a 2-dimensional representaton of a vector: 


° The sixth type of R object is a function. Functions can create, manipulate, operate on, and store 
data; however, we will use functions primarily to execute a series of R “commands” and not as 
primary data objects. 
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> y <- matrix(x, nrow = 2) 
> y 

[,1] [,2] [,3] [,4] [,5] [,6] 
[1,] 1 3 5 7 9 11 
[2,] 2 4 6 S. O'! «te 


An array is a3 or more dimensional represention of a vector: 


> z <- array(x, dim = c(2, 3, 2)) 


SZ 
1, Tv 1 

[,1} [,2] [,3] 
[1,] 1 5 
[2, ] 2 4 6 
, Tv 2 

[,1} [,2] [,3] 
[1,] a 9 11 
[2,] 8 10 12 


A list is a collection of “bins,” each containing any kind of R object: 


> mylist <- list(x, y, 2) 
> mylist 

[[1] 
[1] 1 2 3 4 5 6 7 8 9 10 11 12 





[[2] 


aie I 3 5 7 Oe. S35. 
2 > 4 6 Se We: iD 
[[3]] 
pope dk 
belly e290 E37] 
Pe] 1 5 
[2] 2 4 S 
i a2 
Ppl De2)) Es] 
iP 7 9 11 
De 8 1G «22 
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A data frame is a list in tabular form where each “bin” contains a data vector of the 
same length. A data frame is the usual tabular data set familiar to epidemiologists. 
Each row is an record and each column (“bin”) is a field. 


kids <- c("Tomasito", "Lusito", "Angelita") 
gender <- c("boy", "boy", "girl") 
age <- c(8, 7, 4) 
mydf <- data.frame(kids, gender, age) 
mydf 
kids gender age 
1 Tomasito boy 8 
2 Lusito boy a 
3 Angelita girl 4 


VVVNVV 


In the next chapter we explore these R data objects in greater detail. 
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Problems 


1.1. If you have not done so already, install R on your personal computer. What is 
the R workspace file on your operating systems? What is the file path to your R 
workspace file? What is the name of this workspace file? 


1.2. By default, which R packages come already loaded? What are the file paths to 
the default R packages? 


1.3. List all the object in the current workspace. If there are none, create some data 
objects. Using one expression, remove all the objects in the current workspace. 


1.4. One inch equals 2.54 centimeters. Correct the following R code and create a 
conversion table. 


inches <- 1:12 
centimeters <- inches/2.54 
chind(inches, centimeters) 


1.5. To convert between temperatures in degrees Celsius (C) and Farenheit (F’), we 
use the following conversion formulas: 


5 
C= (F —32)-— 
(F-32)3 
9 
F==C+32 
5 te 


At standard temperature and pressure, the freezing and boiling points of water are 
0 and 100 degrees Celsius, respectively. What are the freezing and boiling points of 
water in degrees Fahrenheit? 


1.6. For the Celsius temperatures 0, 5, 10, 15, 20, 25, ..., 80, 85, 90, 95, 100, con- 
struct a conversion table that displays the corresponding Fahrenheit temperatures. 
Hint: to create the sequence of Celsius temperatures use seq(0, 100, 5). 


1.7. BMI is a reliable indicator of total body fat, which is related to the risk of 
disease and death. The score is valid for both men and women but it does have some 
limits. BMI does have some limitations: it may overestimate body fat in athletes and 
others who have a muscular build, it may underestimate body fat in older persons 
and others who have lost muscle mass. 


Table 1.5 Body mass index classification 


BMI Classification 
<18.5 Underweight 
[18.5,25) Normal weight 
(25,30) Overweight 

> 30 Obesity 
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Body Mass Index (BMI) is calculated from your weight in kilograms and height 
in meters: 


k 
BMI = 
m 


kg = 2.21b 
Im 3.3ft 


Calculate your BMI (don’t report it to us). 


1.8. Using Table 1.1 on page 8, explain in words, and use R to illustrate, the differ- 
ence between modulus and integer divide. 


1.9. In mathematics, a logarithm (to base b) of a number x is written log, (x) and 
equals the exponent y that satisfies x = b’. In other words, 


y = log,(x) 
is equivalent to 
x=b 


In R, the log function is to the base e. Implement the following R code and 
study the graph: 


curve (log(x), 0, 6) 
abline(v = cdl, ‘exp(1)), Hh = ¢(0, 1), Ity = 2) 


What kind of generalizations can you make about the natural logarithm and its 
base—the number e? 


1.10. Risk (R) is a probability bounded between 0 and 1. Odds is the following 


transformation of R: 5 
Odds = ——~ 
oP 


Use the following code to plot the odds: 
curve (x/(1-x), 0, 1) 

Now, use the following code to plot the log(odds): 
curve (log(x/(1-x)), 0, 1) 


What kind of generalizations can you make about the log(odds) as a transformation 
of risk? 


1.11. Use the data in Table 1.6 on the next page. Assume one is HIV-negative. If the 
probability of infection per act is p, then the probability of not getting infected per 
act is (1 — p). The probability of not getting infected after 2 consecutive acts is (1 — 
p)*, and after 3 consecutive acts is (1— p)*. Therefore, the probability of not getting 
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Table 1.6 Estimated per-act risk (transmission probability) for acquisition of HIV, by exposure 
route to an infected source. Source: CDC [1] 








Exposure route Risk per 10,000 exposures 
Blood transfusion (BT) 9,000 
Needle-sharing injection-drug use (IDU) 67 

Receptive anal intercourse (RAI) 50 
Percutaneous needle stick (PNS) 30 

Receptive penile-vaginal intercourse (RPVI) 10 

Insertive anal intercourse (IAI) 6.5 

Insertive penile-vaginal intercourse (IPVI) 

Receptive oral intercourse on penis (ROI) 1 

Insertive oral intercourse with penis IO] 0.5 





infected infected after n consecutive acts is (1 — p)", and the probability of getting 
infected after n consecutive acts is 1 — (1 — p)”. For each non-blood transfusion 
transmission probability (per act risk) in Table 1.6, calculate the cumulative risk of 
being infected after one year (365 days) if one carries out the same act once daily 
for one year with an HIV-infected partner. Do these cumulative risks make intuitive 
sense? Why or why not? 


1.12. The source function in R is used to “source” (read in) ASCII text files. Take 
a group of R commands that worked from a previous problem above and paste them 
into an ASCII text file and save it with the name job01.R. Then from R command 
line, source the file. Here is how it looked on my Linux computer running R: 


> source ("/home/t ja/Documents/courses/ph251d/jobs/job01.R") 
Describe what happened. Now, set echo option to TRUE. 


> source ("/home/t ja/Documents/courses/ph251d/jobs/job01.R", echo = TRUI 





Gl 
~~ 


Describe what happened. To improve your understanding read the help file on the 
source function. 


1.13. Now run the source again (without and with echo = TRUE) but each 
time create a log file using the sink function. Create two log files: job01.logla 
and job01.log1b. 


sink ("/home/t ja/Documents/courses/ph251d/jobs/job01.logla") 
source ("/home/t ja/Documents/courses/ph251d/jobs/job01.R") 
sink() #closes connection 


sink ("/home/t ja/Documents/courses/ph251d/jobs/job01.loglb") 
source ("/home/t ja/Documents/courses/ph251d/jobs/job01.R", echo = TRUI 
sink() #closes connection 








Gl 
~~ 


VVVVNV VV 





Examine the log files and describe what happened. 


1.14. Create a new job file (ob02.R) with the following code: 
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n <- 365 


per.act.risk <- c(0.5, 1, 5, 6.5, 10, 30, 50, 67)/10000 
risks <- 1-(l-per.act.risk) “n 
show (risks) 


Source this file at the R command line and describes what happens. 
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Working with R data objects 





2.1 Data objects in R 


2.1.1 Atomic vs. recursive data objects 


The analysis of data in R involves creating, manipulating, and operating on data 
objects using functions. Data in R are organized as objects and have been assigned a 
name. We have already been introduced to several R data objects. We will now make 
some additional distinctions. Every data object has a mode and length. The mode of 
an object describes the type of data it contains and is available by using the mode 
function. An object can be of mode character, numeric, logical, list, or function. 


> fname <- c("Juan", "Miguel"); mode (fname) 


> 
> 





a 


1 


1 


> mode (1t25) 


1 


1 


1 


"character" 


> age <- c(34, 20); mode (age) 


"numeric" 


1t25 <- age<25 
1t25 





FALSE TRUE 





Wy " 





ogical 





> mylist <- list(fname, age); mode(mylist) 


"W list "W 


> mydat <- data.frame(fname, age); mode (mydat) 








W list " 


> myfun <- function(x) {x72} 
> myfun(5) 


25 


26 2 Working with R data objects 


[1] 25 
> mode (myfun) 
[1] "function" 


Data objects are further categorized into atomic or recursive objects. An atomic 
data object can only contain elements from one, and only one, of the following 
modes: character, numeric, or logical. Vectors, matrices, and arrays are atomic data 
objects. A recursive data object can contain data objects of any mode. Lists, data 
frames, and functions are recursive data objects. We start by reviewing atomic data 
objects. 

A vector is a collection of like elements without dimensions.! The vector el- 
ements are all of the same mode (either character, numeric, or logical). When R 
returns a vector the [] indicates the position of the element displayed to its imme- 
diate right. 


> y <- c("Pedro", "Paulo", "Maria") 
> y 

[1] "Pedro" "Paulo" "Maria" 

Sox = el 25 Bye -4e 5) 

> x 

[1] 12345 

Dee 8 

[1] TRUE TRUE FALSE FALSE FALSE 




















A matrix is a collection of like elements organized into a 2-dimensional (tabular) 
data object. We can think of a matrix as a vector with a 2-dimensional structure. 
When R returns a matrix the [n, ] indicates the nth row and [,m] indicates the 
mth column. 


BK Se eta, "bt, "et, s"a™) 
> y <- matrix(x, 2, 2) 
2EAY 
[,1] [,2] 
[ay de Mam er 
(27). "bX td 


An array is a collection of like elements organized into a n-dimensional data 
object. We can think of an array as a vector with an n-dimensional structure. When 
R returns an array the [n,, ] indicates the nth row and [,m, ] indicates the mth 
column, and so on. 


>x <- 1:8 
> y <- array(x, dim=c(2, 2, 2)) 


> ¥ 
LA Tv il 


' In other programming languages, vectors are either row vectors or column vectors. R does not 
make this distinction until it is necessary. 
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[1,] 1 3 
[2,] 2 4 
, v 2 

[,1] [,2] 
[1,] 5 7 
[2,] 6 8 


If we try to include elements of different modes in an atomic data object, R will 
coerce the data object into a single mode based on the following hierarchy: character 


> numeric > logical. In other words, if an atomic data object contains any character 
element, all the elements are coerced to character. 























> c("hello", 4.56, FALSE) 
[1] "hello" "4.56" "FALSE" 
> c(4.56, FALSE) 

[1] 4.56 0.00 


A recursive data object can contain one or more data objects where each object 
can be of any mode. Lists, data frames, and functions are recursive data objects. Lists 
and data frames are of mode list, and functions are of mode function (Table 2.1 on 
the next page). 

A list is a collection of data objects without any restrictions: 








> x <- c(l1, 2, 3) 
> y <- c("Male", "Female", "Male") 
> z <- matrix(1:4, 2, 2) 
> mylist <- list(x, y, 2) 
> mylist 
[[1] 
[1] 2 3 
[2] 
[1] "Male" "Female" "Male" 
[[3] 
[,1] [,2] 
[1,] 1 3 
[2,] 2 4 


A data frame is a list with a 2-dimensional (tabular) structure. Epidemiolo- 
gists are very experienced working with data frames where each row usually repre- 
sents data collected on individual subjects (also called records or observations) and 
columns represent fields for each type of data collected (also called variables). 


> subjno <- c(1, 2, 3, 4) 
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Table 2.1 Summary of six types of data objects in R 


2 Working with R data objects 














Data object Possible mode Default class 
Atomic 
vector character, numeric, logical NULL 
matrix character, numeric, logical NULL 
array character, numeric, logical NULL 
Recursive 
list list NULL 
data frame list data frame 
function function NULL 
> age <- c(34, 56, 45, 23) 
> sex <- c("Male", "Male", "Female", "Male") 
> case <-— c("Yes", "No", "No", "Yes") 
> mydat <- data.frame(subjno, age, sex, case) 
> mydat 
subjno age sex case 
1 1 34 Male Yes 
2 2 56 Male No 
3 3 45 Female No 
4 4 23 Male Yes 
> mode (mydat) 
fl] “List*" 


2.1.2 Assessing the structure of data objects 


Summarized in Table 2.1 are the key attributes of atomic and recursive data objects. 
Data objects can also have class attributes. Class attributes are just a way of letting 
R know that an object is “special,” allowing R to use specific methods designed 
for that class of objects (e.g., print, plot, and summary methods). The class 
function displays the class if it exists. For our purposes, we do not need to know any 
more about classes. 

Frequently, we need to assess the structure of data objects. We already know that 
all data objects have a mode and length attribute. For example, let’s explore the infert 
data set that comes with R. The infert data comes from a matched case-control 
study evaluating the occurrence of female infertility after spontaneous and induced 
abortion. 


> data(infert) #loads data 
> mode (infert) 
[i] “irst" 


> length (infert) 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


2.2 A vector is a collection of like elements 29 


[1] 8 


At this point we know that the data object named “‘infert” is a list of length 8. To 
get more detailed information about the structure of infert use the st r function 
(str comes from “str ucture). 


> str (infert) 


‘data. frame’: 248 obs. of 8 variables: 

$ education : Factor w/ 3 levels "O-5yrs",..: 1111 
$ age >: num 26 42 39 34 35 36 23 32 21 28 

S parity >: num 61643412412 

S$ induced >: num 1122120000 

S$ case : num 1212i123212d1aa3at1i1i 

S$ spontaneous > num 2000110010... 

S$ stratum fant 2h:2) 43-4566 728 Gullo 


Great! This is better. We now know that infert is a data frame with 248 ob- 
servations and 8 variables. The variable names and data types are displayed along 
with their first few values. In this case, we now have sufficient information to start 
manipulating and analyzing the infert data set. 

Additionally, we can extract more detailed structural information that becomes 
useful when we want to extract data from an object for further manipulation or 
analysis (Table 2.2 on the next page). We will see extensive use of this when we 
start programming in R. 

To get practice calling data from the command line, enter data () to display the 
available data sets in R. Then enter data (data_set) to load a dataset. Study the 
examples in Table 2.2 on the following page and spend a few minutes exploring the 
structure of the data sets we have loaded. To display detailed information about a 
specific data set use ?data_set at the command prompt (e.g., ?infert). 


2.2 A vector is a collection of like elements 


2.2.1 Understanding vectors 


A vector is a collection of like elements (i.e., the elements all have the same mode). 
There are many ways to create vectors (see Table 8). The most common way of 
creating a vector is using the concatenation function c: 


> #numeric 

> chol <- c(136, 219, 176, 214, 164) 
> chol 
[1] 136 219 176 214 164 

> #character 

> fname <- c("Mateo", "Mark", "Luke", "Juan", "Jaime") 
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Table 2.2 Useful functions to assess R data objects 


2 Working with R data objects 





Function Description 


Try these examples 





Returns summary objects 


str Displays summary of data object 
structure 
attributes Returns list with data object 


attributes 


Returns specific information 





str(infert) 


attributes (infert) 




















mode Returns mode of object mode (infert) 

class Returns class of object, if it exists class (infert) 

length Returns length of object length (infert) 

dim Returns vector with object dim(infert) 
dimensions, if applicable 

nrow Returns number of rows, if applicable nrow(infert) 

ncol Returns number of columns, if ncol (infert) 
applicable 

dimnames Returns list containing vectors of dimnames (infert) 
names for each dimension, if 
applicable 

rownames Returns vector of row names of a rownames (infert) 
matrix-like object 

colnames Returns vector of column names ofa colnames(infert) 
matrix-like object 

names Returns vector of names for the list, names (infert) 
if applicable (for a data frame it 
returns the field names) 

row.names Returns vector of row names for a row.names (infert) 
data frame 

head Display first 6 lines of a data frame head (infert) 

infert[1:6,] #equivalent 

> fname 

[1] "Mateo" "Mark" "Luke" "Juan" "Jaime" 

> #logical 

Dog <= e(T, Ll, F, TE) 

23, 

[1] TRUE TRUE FALSE TRUE FALSE 


A single digit is also a vector; that is, a vector of length = 1. Let’s confirm this. 


> 5 

ie Ea es) 

> is.vector (5) 
[1] TRUE 
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2.2.1.1 Boolean operations on vectors 


In R, we use relational and logical operators (Table 2.3 on page 33) to conduct 
Boolean queries. Boolean operations is a methodological workhorse of data analy- 
sis. For example, suppose we have a vector of female movie stars and a correspond- 
ing vector of their ages (as of January 16, 2004), and we want to select a subset of 
actors based on age criteria. 





> movie.stars 

[1] "Rebecca De Mornay" "Elisabeth Shue" "Amanda Peet" 
[4] "Jennifer Lopez" "Winona Ryder" "Catherine Zeta Jones" 
[7] "Reese Witherspoon" 

> ms.ages 

[ 


1] 42 40 32 33 32 34 27 


Let’s select the actors who are in their 30s. This is done using logical vectors that 






























































are created by using relational operators (<, >, <=, >=, ==, !=). Study the following 
example: 

> #logical vector for stars with ages >=30 

> ms.ages >= 30 

[1] TRUE TRUE RUE TRUE RUE TRUE FALSE 

> #logical vector for stars with ages <40 

> ms.ages < 40 

[1] FALSE FALSE RUE TRUE RUE TRUE TRUE 

> #logical vector for stars with ages >=30 and <40 

> (ms.ages >= 30) & (ms.ages < 40) 

[1] FALSE FALSE RUE TRUE RUE TRUE FALSE 

> thirtysomething <- (ms.ages >= 30) & (ms.ages < 40) 

> #indexing vector based on logical vector 








> movie.stars[thirtysomething 


[1] "Amanda Peet" "Jennifer Lopez" "Winona Ryder" 
[4] "Catherine Zeta Jones" 
We also saw that we can compare logical vectors using logical operators (&, |, !). 


For more examples see Table 7. The expression movie.stars[thirtysomething] 
is an example of indexing using a logical vector. Now, we can use the ! function to 
select the stars that are not “thirtysomething.” Study the following: 


> thirtysomething 

[1] FALSE FALS TRUE TRUE TRUE TRUE FALS! 
> !thirtysomething 
[1] TRUE TRUE FALSE FALSE FALSE FALSE  TRUI 
> movie.stars[!thirtysomething] 

[1] "Rebecca De Mornay" "Elisabeth Shue" "Reese Witherspoon" 





T. 
GI 

















GI 











To summarize: 
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e Logical vectors are created using Boolean comparisons, 
e Boolean comparisons are constructed using relational and logical operators 
e Logical vectors are commonly used for indexing (subsetting) data objects 


Before moving on, we need to be sure we understand the previous examples, then 
study the examples in Table 2.3 on the facing page. For practice, study the examples 
and spend a few minutes creating simple numeric vectors, then (1) generate logical 
vectors using relational operators, (2) use these logical vectors to index the original 
numerical vector or another vector, (3) generate logical vectors using the combina- 
tion of relational and logical operators, and (4) use these logical vectors to index the 
original numerical vector or another vector. 

The element-wise exclusive “or” operator (xor) returns TRUE if either compar- 
ison element is TRUE, but not if both are TRUE. In contrast, the | returns TRUE if 
either or both comparison elements are TRUE. 

The && and | | logical operators are used for control flow in if functions. If 
logical vectors are provided, only the first element of each vector is used. Therefore, 
for element-wise comparisons of 2 or more vectors, use the & and | operators but 
not the && and | | operators. 


2.2.2 Creating vectors 


Vectors are created directly, or indirectly as the result of manipulating an R object. 
The c function for concatenating a collection has been covered previously. Another, 
possibly more convenient, method for collecting elements into a vector is with the 
scan function. 


> x <- scan () 
1: 45 67 23 89 
O° 

Read 4 items 

> x 

[1] 45 67 23 89 


This method is convenient because we do not need to type c, parentheses, and com- 
mas to create the vector. The vector is created after executing a newline twice. 
To generate a sequence of consecutive integers use the : function. 


> -9:8 
[1] -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 





However, the seq function provides more flexibility in generating sequences. Here 
are some examples: 


> seq(1, 5, by = 0.5) ##specify interval 
EU )e 2x0 De Sr720:- 24.5" 30° 35740! 4.5) D510 
> seq(1, 5, length = 8) ##specify length 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


2.2 A vector is a collection of like elements 33 


Table 2.3 Using Boolean relational and logical operators 





Operator Description Try these examples 





Relational operators 
< Less than pos: <- -e("pl";- "p2",. "p33"; "pa", 
"ps") 
so OTs 25 ye Ay 
Ve SNCS: 45.39: 5251) 
x<y 
pos[x < y] 


> Greater than x > y 
pos[x > y] 

<= Less than or equalto. x <= y 
pos[x <= y] 

>= Greater than orequal x >= y 

to pos[x >= y] 

—— Equal to x == y 
pos[x == y] 

!= Not equal to x l= y 
pos[x != y] 

Logical operators 

! NOT x > 2 
'(x > 2) 
pos[!(x > 2)] 

& Element-wise AND (<> Tye te 5) 
pos[(x > 1) & (x < 5)] 

| Element-wise OR (x <= 1) | (x > 4) 
pos[(x <= 1) | (x > 4)] 

xor Exclusive OR; similar xx <- x <= 1 

to | for comparing yy <- x > 4 


two vectors, butonly xor(xx, yy) 
TRUE if one or the xx | yy 
other is true, not both 


Logical operators for if function 


KE Similar to & butused if (T && T) print ("Both TRUE") 
with if function # nothing printed if (F && T) 
print ("Both TRUE") 
l| Similar to | butused if (T || F) print ("Either TRUE") 
with if function # nothing printed if (F || F) 


print ("Either TRUE") 





[1] 1.0000 1.5714 2.1429 2.7143 3.2857 3.8571 4.4286 5.0000 
> x <- 1:8 

> seq(1, 5, along x) ##by length of other object 

[1] 1.0000 1.5714 2.1429 2.7143 3.2857 3.8571 4.4286 5.0000 
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Table 2.4 Common ways of creating vectors 








Function Description Try these examples 
c Concatenate a collection x <- c(1, 2, 3, 4, 5) 
Vis= C6; hy 8, 97> 10) 
2 > Oy I 
scan Scan a collection (after xx <- scan() 
entering data press Enter 1 2 3 4 5 
twice) 
yy <- scan(what = "") 
"Javier" "Miguel" "Martin" 
XX; VY 
Generate integer 1:10 
sequence 10: (-4) 
seq Generate sequence of seq(1, 5, by = 0.5) 
numbers seq(1, 5, length = 3) 
ZZ <- ectar, tb, Kem) 
seq(along = zz) 
rep Replicate argument rep ("Juan Nieve", 3) 
rep(1:3, 4) 
rep(1:3, 3:1) 
which Integer vector from age <-— c(8, NA, 7, 4) 
Boolean operation which(age<5 | age>=8) 
paste Paste elements creatinga paste(c("A", "B", "C"), 1:3) 
character string paste(c("A", "B", "C"), 1:3, sep="") 
[row#, ] Indexing a matrix returns xx <- matrix(1:8, nrow = 2, ncol = 
or a vector 4) 
[ ,col#] xx[2,] 
xx[,3] 
sample Sample from a vector sample(c("H","T"), 20, replace = 
TRUE) 
runif Generate random rnorm(10, mean = 50, sd = 19) 
rmorm numbers from a runif(n = 10, min = 0, max = 1) 
rbinom probability distribution rbinom(n = 10, size = 20, p = 0.5) 
rpois rpois(n = 10, lambda = 15) 
as.vector Coerce data objects into mx <- matrix(1:4, nrow = 2, ncol = 
a vector 2) 
mx 
as.vector (mx) 
vector Create vector of specified vector ("Ccharacter", 5) 
mode and length vector ("numeric", 5) 
vector ("logical", 5) 
character Create vector of character (5) 
numeric specified type numeric (5) 
logical logical (5) 
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These types of sequences are convenient for plotting mathematical equations.” 


For example, suppose we wanted to plot the standard normal curve using the normal 
equation. For a standard normal curve 1 = 0 (mean) and o = | (standard deviation) 


Here is the R code to plot this equation (see Figure 2.1): 


mu <- 0; sigma <- 1 

x <- seq(-4, 4, .01) 

fx <- (1/sqrt (2*pi+xsigma”2) ) xexp (-(x-mu) *2/ (2*sigma”2) ) 
plot(x, fx, type = "1", lwd = 2) 


After assigning values to mu and sigma, we assigned to x a sequence of num- 
bers from —4 to 4 by intervals of 0.01. Using the normal curve equation, for every 
value of x we calculated f(x), represented by the numeric vector fx. We then used 
the plot function to plot x vs. f(x). The optional argument t ype="1" produces a 
“line” and 1wd=2 doubles the line width. For comparison, we also plotted a density 
histogram of 500 standard normal variates that were simulated using the rnorm 
function (Figure 2. 1)3 

The rep function is used to replicate its arguments. Study the examples that 
follow: 


+ 
oS + 
oO 
a 
N oO 
xo 5S 
a 
= 2 
o Silo 
-4 02 4 -3 -1 1 3 
x rnorm(500) 


Fig. 2.1 Standard normal curve from equation and simulation 


? See also the curve function for graphing mathematical equations. 
Shist (rnorm(500), freq = FALSE, breaks = 25, main="") 
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> rep(5, 2) #repeat 5 2 times 

falco 25D 

> rep(1:2, 5) # repeat 1:2 5 times 
[aera 22h 2s he Zid Zed. &2 

> rep(1:2, c(5, 5)) # repeat 1 5 times; repeat 2 5 times 
fabiifia eAe ae Ae Nae os eB. 

> rep(1:2, rep(5, 2)) # equivalent to previous 
[ele] yd le Al de 22 2 2 2 

> rep(1:5, 5:1) # repeat 15 times, repeat 2 4 times, etc 
Fi} det hh 22 223: 3s 34s 4 5 


The paste function pastes character strings: 


> fname <- c("John", "Kenneth", "Sander") 
> lname <- c("Snow", "Rothman", "Greenland") 
> paste(fname, lname) 


[1] "John Snow" "Kenneth Rothman" "Sander Greenland" 
> paste("var", 1:7, sep="") 
[1] "varl" "var2" "var3" “"var4" "var5" "var6" “var7" 


Indexing (subsetting) an object often results in a vector. To preserve the dimen- 
sionality of the original object use the drop option. 


> x <- matrix(1:8, 2, 4) 





> x 
Peds eZ) le 3) L744] 

[ily] 1 3 5 7 

[2] 2 4 6 8 

> x[2,] #index 2nd row 

[1] 2 4 6 8 

> x[2, , drop = FALSE] #index 2nd row; keep object structure 
eld 2) ib 3] by 4] 

pa 5] 2 4 6 8 


Up to now we have generated vectors of known numbers or character strings. On 
occasion we need to generate random numbers or draw a sample from a collection 
of elements. First, sampling from a vector returns a vector: 


> # toss 8 coins 

















> sample(c("Head","Tail"), size = 8, replace = TRUE) 

[1] "Head" "Head" Tey "Head" Whack lc were ® "Head" "Head" 
> # toss 2 die 

> sample(1:6, size = 2, replace = TRUE) 

[Ady 4 


Second, generating random numbers from a probability distribution returns a vector: 


> # toss 8 coins twice using the binomial distribution 
> rbinom(n = 2, size = 8, p = 0.5) 
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Table 2.5 Common ways of naming vectors 








Function Description Try these examples 

c Name vector elements x <- c(a = 1, b = 2, c 3, d= 4); 
at time that vector is x 
created 

names Name vector elements y <- 1:4; y 


namées'(y) -<-c:( "a", "b™, “em MAM) py 
#return names, if they exist 
names (y) 
unname Remove names y <- unname (y) 
y #equivalent: names(y) <-— NULL 


[2]. 5,2 

> # generate 6 standard normal distribution values 

> rnorm(6) 

[1] 1.52826 -0.50631 0.56446 0.10813 -1.96716 2.01802 


There are additional ways to create vectors. To practice creating vectors study 
the examples in Table 2.4 on page 34 and spend a few minutes creating simple 
vectors. If we need help with a function remember enter ? funct ion_name or 
help (function_name) . 

Finally, notice that we use vectors as arguments to functions: 


# character vector used in ‘sample’ function 
sample(c("head", "tail"), 100, replace = TRU 
# numeric vector used in ’rep’ function 
rep(1:2, rep(5, 2)) 

# numeric vector used in /’/matrix’ function 
matrix(c(23, 45, 16, 17), nrow = 2, ncol = 2) 





GI 
~~ 


2.2.3 Naming vectors 


The first way of naming vector elements is when the vector is created: 


> x <- c(chol = 234, sbp = 148, dbp = 78, age = 54) 
> x 

chol sbp dbp age 

234 148 78 54 


The second way is to create a character vector of names and then assign that vector 
to the numeric vector using the names function: 


> z <- c(234, 148, 78, 54) 
> Z 
[1] 234 148 78 54 
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> names(z) <- c("chol", "sbp", "dbp", "age") 
Suz. 

chol sbp dbp age 

234 148 78 54 


The names function, without an assignment, returns the character vector of names, 
if it exist. This character vector can be used to name elements of other vectors. 


> names (z) 


[1] "chol" "sbp" "dbp" "age" 
> z2 <- c(250, 184, 90, 45) 
> 22 


[1] 250 184 90 45 

> names(z2) <- names (z) 
> z2 

chol sbp dbp age 
250 184 90 45 


The unname function removes the element names from a vector: 


> unname (z2) 
[1] 250 184 90 45 


For practice study the examples in Table 2.5 on the preceding page and spend a 
few minutes creating and naming simple vectors. 


2.2.4 Indexing vectors 


Indexing a vector is subsetting or extracting elements from a vector. A vector is 
indexed by position(s), by name(s), and by logical vector. Positions are specified by 
positive or negative integers. 


> x <- c(chol = 234, sbp = 148, dbp = 78, age = 54) 
> x[c(2, 4)] #extract 2nd and 4th element 

sbp age 

148 54 

> x[-c(2, 4)] #exclude 2nd and 4th element 

chol dbp 

234 78 


Although indexing by position is concise, indexing by name (when the names exists) 
is better practice in terms of documenting our code. Here is an example: 


> x[c("sbp", "age")] #extract 2nd and 4th element 
sbp age 
148 54 
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Table 2.6 Common ways of indexing vectors 





Indexing Try these examples 





By position x <- c(chol = 234, sbp = 148, dbp = 78, age = 
54) 

2] #positions to include 

c(2, 3)] 

-c(1, 3, 4)] #positions to exclude 

-c(1, 4)] 

which (x<100) ] 

"sbp"] 

c("sbp", “dbp") ] 

< 100 

x < 100] 

x < 150) 6 (x > 70) 

p <- (x < 150) & (x > 70) 

bp] 

Unique values samp <- sample(1:5, 25, replace=T); samp 
unique (samp) 


™* 


By name (if exists) 


By logical 





x O~KX KX KK KK KM MM 


Duplicated values duplicated(samp) #generates logical 
samp [duplicated (samp) ] 





A logical vector indexes the positions that corresponds to the TRUEs. Here is an 
example: 


> x<=100 | x>200 

chol sbp dbp age 
TRUE FALSE TRUE TRUE 
> x[x<=100 | x>200] 
chol dbp age 

234 78 54 




















Any expression that evaluates to a valid vector of integers, names, or logicals can 
be used to index a vector. 


> (sampl <- sample(1:4, 8, replace = TRUE) ) 

fle]. Ae (38 3: <3) aky B24 
> x[samp1] 

chol dbp dbp dbp chol dbp age chol 

234 78 78 78 234 78 54 234 
> (samp2 <- sample(names(x), 8, replace = TRUE) ) 

[1] "dbp" "sbp" "sbp" "dbp" "dbp" "age" "dbp" "sbp" 
> x[samp2] 
dbp sbp sbp dbp dbp age dbp sbp 

78 148 148 78 78 54 78 148 





Notice that when we indexed by position or name we indexed the same position 
repeatly. This will not work with logical vectors. In the example that follows NA 
means “not available.” 
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> (samp3 <- sample(c(TRUE, FALSE), 8, replace = TRUE) ) 
[1] FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE 
> x[samp3] 
dbp <NA> <NA> <NA> <NA> 
78 NA NA NA NA 














Tr 


























We have already seen that a vector can be indexed based on the characteristics of 
another vector. 


> kid <- c("Tomasito", "Irene", "Luisito", "Angelita", "Tomas") 
> age <- c(8, NA, 7, 4, NA) 
> age<x=7 # produces logical vector 























[1] FALSE NA TRUE TRUE NA 

> kid[age<=7] # index ’kid’ using ’age’ 

[1] NA "Luisito" "Angelita" NA 

> kid[!is.na(age)] # remove missing values 
[1] "Tomasito" "Luisito" "Angelita" 

> kid[age<=7 & !is.na(age) ] 

[1] "Luisito" "Angelita" 


In this example, NA represents missing data. The is.na function returns a logical 
vector with TRUEs at NA positions. To generate a logical vector to index values that 
are not missing use !is.na. 

For practice study the examples in Table 2.6 on the previous page and spend a 
few minutes creating, naming, and indexing simple vectors. 


2.2.4.1 The which function 





A Boolean operation that returns a logical vector contains TRUE values where the 
condition is true. To identify the position of each TRUE value we use the which 
function. For example, using the same data above: 





> which(age<=7) # which positions meet condition 
[1] 3 4 

> kid[which (age<=7) ] 

[1] "Luisito" "Angelita" 


Notice that is was unnecessary to remove the missing values. 


2.2.5 Replacing vector elements (by indexing and assignment) 


To replace vector elements we combine indexing and assignment. Any elements of a 
vector that can be indexed can be replaced. Replacing vector elements is one method 
of recoding a variable. 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


2.2 A vector is a collection of like elements 41 


Table 2.7 Common ways of replacing vector elements 








Replacing Try these examples 

By position x <- c(chol = 234, sbp = 148, dbp = 78, age = 
54) 
x[1] 


x[1] <- 250 
x 

By name (if exists) x["sbp"] 
x["sbp"] <- 150 
x 

By logical x[x<100] 
x[x<100] <- NA 
x 





> # simulate vector with 1000 age values 
> age <- sample(0:100, 1000, replace = TRUE) 
> mean (age) 





[1] 50.378 

> sd(age) 

[1] 28.25947 

> agecat <- age 

> agecat [age<15] <- "<15" 

> agecat [age>=15 & age<25] <- "15-24" 
> agecat [age>=25 & age<45] <- "25-44" 
> agecat [age>=45 & age<65] <- "45-64" 
> agecat [age>=65] <- "65+" 

> table(agecat) 


agecat 
<15 15-24 25-44 45-64 65+ 
125 107 207 206 3:55 


First, we made a copy of the numeric vector age and named it agecat. Then, we 
replaced elements of agecat with character strings for each age category, creating 
a character vector. 

For practice study the examples in Table 2.7 and spend a few minutes replacing 
vector elements. 


2.2.6 Operations on vectors 


Operations on vectors is very common in epidemiology and statistics. In this section 
we cover common operations on single vectors (Table 2.8 on the next page) and 
multiple vectors (Table 2.9 on page 46). 
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Table 2.8 Selected operations on single vectors 








Function Description Function Description 

sum summation range range 

cumsum cumulative sum rev reverse order 

diff x[i+1]-x[i] order order 

prod product sort sort 

cumprod cumulative product rank rank 

mean mean sample random sample 
median median quantile percentile 

min minimum var variance, covariance 
max maximum sd standard deviation 





2.2.6.1 Operations on single vectors 


First, we focus on operating on single numeric vectors (Table 2.8). This also gives us 
the opportunity to see how common mathematical notation is translated into simple 
R code. 

To sum elements of a numeric vector x of length n, ()°"_, xi), use the sum func- 
tion: 


> # generate and sum 100 random standard normal values 
> x <- rnorm(100) 

> sum (x) 

[1] -0.34744 


To calculate a cumulative sum of a numeric vector x of length n, (Cees for 
k =1,...,n), use the cumsum function which returns a vector: 


# generate sequence of 2’s and calculate cumulative sum 
> x <- rep(2, 10) 
> x 
[dep 25 22 220 ae 22 22 2 
> cumsum (x) 
[1] 2 4 6 8 1012 14 16 18 20 


To multiply elements of a numeric vector x of length n, (T]Tj_, xi), use the prod 
function: 


pox <= CCl, 2, 3; 2p Sy 65-7 Re 8) 
> prod(x) 
[1] 40320 


To calculate the cumulative product of a numeric vector x of length n, (e3 Xi; 
fork =1,...,n), use the cumprod function: 


Pox -<=—.eCly .2,. Sy Ap “Sy -67 Vp 8) 
> cumprod (x) 
[1] lt 2 6 24 120 720 5040 40320 
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To calculate the mean of a numeric vector x of length n, Gd Yi xi), use the sum 
and length functions, or use the mean function: 


> x <- rnorm(100) 
> sum(x) /length (x) 
[1] 0.05843341 

> mean (x) 

[1] 0.05843341 


To calculate the sample variance of a numeric vector x of length n, use the sum, 
mean, and length functions, or, more directly, use the var function. 


fea 


1 





SS 


> x <- rnorm(100) 

> sum((x-mean (x) ) 72) / (length (x) -1) 
[1] 0.9073808 
> var (x) # equivalent 
[1] 0.9073808 





This example illustrates how we can implement a formula in R using several func- 
tions that operate on single vectors (sum, mean, and length). The var function, 
while available for convenience, is not necessary to calculate the sample variance. 

When the var function is applied to two numeric vectors, x and y, both of length 
n, the sample covariance is calculated: 


1 


n—-1 





Sxy = x (=a) o-9) 

i=] 

> x <- rnorm(100); y <- rnorm(100) 

> sum((x-mean (x) ) * (y-mean(y))) / (length (x) -1) 
[1] -0.09552851 

> var(x, y) # equivalent 

[1] -0.09552851 


The sample standard deviation, of course, is just the square root of the sample 
variance (or use the sd function): 





(xi a 


> sqrt (var (x) ) 
[1] 0.9525654 
> sd(x) 

[1] 0.9525654 
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To sort a numeric or character vector use the sort function. 


> ages <- c(8, 4, 7) 
> sort (ages) 
[1] 47 8 


However, to sort one vector based on the ordering of another vector use the order 
function. 


> ages <- c(8, 4, 7) 

> subjects <- c("Tomas", "Angela", "Luis") 

> subjects [order (ages) ] 

[1] "Angela" "Luis" "Toms" 

> # ’order’ returns positional integers for sorting 
> order (ages) 

Pa yoa 31 


Notice that the order function does not return the data, but rather indexing in- 
tegers in new positions for sorting the vector age or another vector. For example, 
order (ages) returned the integer vectorc (2, 3, 1) which means “move the 
2nd element (age = 4) to the first position, move the 3rd element (age = 7) to the 
second position, and move the Ist element (age = 8) to the third position.” Verify 
that sort (ages) and ages [order (ages) ] are equivalent. 

To sort a vector in reverse order combine the rev and sort functions. 


SX <=> -e@(12;. 3, L4,> 3p by 2) 
> sort (x) 

Pi]. 2. «8. 43) -5-db2) La 

> rev (sort (x) ) 

[1] 14 12 5 3 3 1 


In contrast to the sort function, the rank function gives each element of a vector 
a rank score but does not sort the vector. 


SX <-> .a(i2,y 3, 145 3, 75; A) 
> rank (x) 
Fay 5.0% 24:5? -6.2.0.-2...5) 42.0" 2.10 


The median of a numeric vector is that value which puts 50% of the values below 
and 50% of the values above, in other words, the 50% percentile (or 0.5 quantile). 
For example, the medianofc (4, 3, 1, 2, 5) is3.Fora vector of even length, 
the middle values are averaged: the median of c(4, 3, 1, 2) is 2.5. To get the 
median value of a numeric vector use the median or quantile function. 


> ages <- c(23, 45, 67, 33, 20, 77) 
> median (ages) 
[1] 39 
> quantile(ages, 0.5) 
50% 
39 
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To return the minimum value of a vector use the min function; for the maximum 
value use the max function. To get both the minimum and maximum values use the 
range function. 


> ages <- c(23, 45, 67, 33, 20, 77) 

> min (ages) 

1] 20 

> sort (ages) [1] # equivalent 

1] 20 

> max (ages) 

1] 77 

> sort (ages) [length(ages)] # equivalent 
1] 77 

> range (ages) 

1] 20 77 

> c(min(ages), max(ages)) # equivalent 
Lie 2iO 7 








To sample from a vector of length n, with each element having a default sampling 
probability of 1/n, use the sample function. Sampling can be with or without 
replacement (default). If the sample size is greater than the length of the vector, then 
sampling must occur with replacement. 





> coin Lin c("H", wey 
> sample(coin, size = 10, replace = TRUE) 

[1] mAQN wAN Werle wry wet mAN wAN wAN mAN TM 
> sample(1:100, 15) 





[1] 9 24 53 11 15 63 52 73 54 84 82 66 65 20 67 


2.2.6.2 Operations on multiple vectors 


Next, we review selected functions that work with one or more vectors. Some of 
these functions manipulate vectors and others facilitate numerical operations. 
In addition to creating vectors, the c function can be used to append vectors. 


> x <- 6:10 

> y <- 20:24 

> G(X, ¥) 

[1] 6 7 8 9 10 20 21 22 23 24 


The append function also appends vectors; however, one can specify at which 
position. 


> append(x, y) 

[1] 6 7 8 9 10 20 21 22 23 24 
> append(x, y, after = 2) 

[Ln 6. OF 2020 22) 2324: 1B - 29.10 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


46 2 Working with R data objects 


Table 2.9 Selected operations on multiple vectors 








Function Description 

c concatenates vectors 

append Appends a vector to another vector (default is to append at the end 
of the first vector) 

cbind Column-bind vectors or matrices 

rbind Row-bind vectors or matrices 

table Creates contingency table from 2 or more vectors 

xtabs Creates contingency table from 2 or more factors in a data frame 

ftable Creates flat contingency table from 2 or more vectors 

outer Outer product 

tapply Applies a function to strata of a vector 

<>, Relational operators, See Table 2.3 on page 33 

<=, >=, 


I, Logical operators, See Table 2.3 on page 33 





In contrast, the cbind and rbind functions concatenate vectors into a matrix. 
During the outbreak of severe acute respiratory syndrome (SARS) in 2003, a patient 
with SARS potentially exposed 111 passengers on board an airline flight. Of the 
23 passengers that sat “close” to the index case, 8 developed SARS; among the 88 
passengers that did not sit “close” to the index case, only 10 developed SARS [2]. 
Now, we can bind 2 vectors to create a 2 x 2 table (matrix). 


> case <- c("exposed" = 8, "unexposed" = 10) 
> noncase <- c("exposed" = 15, "unexposed" = 78) 
> cbhind(case, noncase) 
case noncase 

exposed 8 15 
unexposed 10 78 
> rbind(case, noncase) 

exposed unexposed 
case 8 10 
noncase 15 78 


For the example that follows, let’s recreate the SARS data as two character vec- 
tors. 


outcome <- c(rep("case", 8+10), rep("noncase", 15+78)) 

tmp <- c("exposed", "unexposed") 

exposure <- c(rep(tmp, c(8, 10)), rep(tmp, c(15, 78))) 

cbind (exposure, outcome) [1:4,] # display 4 rows 
exposure outcome 

[1,] "exposed" "case" 


VV VV 
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[2,] "exposed" "case" 
[3,] "exposed" "case" 
[4,] "exposed" "case" 


Now, use the table function to cross-tabulate one or more vectors. 


> table(outcome, exposure) 


exposure 

outcome exposed unexposed 
case 8 10 
noncase 15 78 


The ft able function creates a flat contingency table from one or more vectors. 


> ftable(outcome, exposure) 
exposure exposed unexposed 


outcome 
case 8 10 
noncase IES 78 


This will come in handy later when we want to display a 3 or more dimensional 
table as a “flat” 2-dimensional table. 

The outer function applies a function to every combination of elements from 
two vectors. For example, create a multiplication table for the numbers | to 5. 


> Outer (1s 5; t 570 ™e™) 
[,1] [,2] [,3] [,4]1 [,5] 
; 3 4 5 
i, 6 8 10 
; 6 9 12 15 


Tv 


Oe WN FR 
ape WN ER 


2 
4 
8 12 16 20 
0 


[ 
[ 
[ 
[ 
[ 15 20 25 


ee a 


1 


Tv 


The tapply function applies a function to strata of a vector that is defined by 
one or more “indexing” vectors. For example, to calculate the mean age of females 
and males: 


> age <- c(23, 45, 67, 88, 22, 34, 80, 55, 21, 48) 
> sex <— c("M", "FT, "MM, URM omyN omer omy omer myn omer) 
> tapply(X = age, INDEX = sex, FUN = mean) 
F M 
54.0 42.6 
> # equivalent 
> tapply(age, sex, sum)/tapply(age, sex, length) 
F M 
54.0 42.6 





The tappl1y function is an important and versatile function because it allows us to 
apply any function that can be applied to a vector, to be applied to strata of a vector. 
Moveover, we can use our user-created functions as well. 
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Table 2.10 Deaths among subjects who received tolbutamide and placebo in the Unversity Group 
Diabetes Program (1970) 
Tolbutamide Placebo 
Deaths 30 21 
Survivors 174 184 


2.3 A matrix is a 2-dimensional table of like elements 


2.3.1 Understanding matrices 


A matrix is a 2-dimensional table of like elements. Matrix elements can be either 
numeric, character, or logical. Contingency tables in epidemiology are represented 
in R as numeric matrices or arrays. An array is the generalization of matrices to 3 or 
more dimensions (commonly known as stratified tables). We cover arrays later, for 
now we will focus on 2-dimensional tables. 

Consider the 2 x 2 table of crude data in Table 2.10 [3]. In this randomized clin- 
ical trial (RCT), diabetic subjects were randomly assigned to receive either tolbu- 
tamide, an oral hypoglycemic drug, or placebo. Because this was a prospective study 
we can calculate risks, odds, a risk ratio, and an odds ratio. We will do this using R 
as a calculator. 





> dat <- matrix(c(30, 174, 21, 184), 2, 2) 
> rownames(dat) <- c("Deaths", "Survivors") 
> colnames(dat) <- c("Tolbutamide", "Placebo") 
> coltot <- apply(dat, 2, sum) #column totals 
> risks <- dat["Deaths", ]/coltot 
> risk.ratio <- risks/risks[2] #risk ratio 
> odds <- risks/(1-risks) 
> odds.ratio <- odds/odds[2] #odds ratio 
> # display results 
> dat 

Tolbutamide Placebo 
Deaths 30 21 
Survivors 174 184 


> rbind(risks, risk.ratio, odds, odds.ratio) 
Tolbutamide Placebo 


risks 0.1470588 0.1024390 
risk.ratio 1.4355742 1.0000000 
odds 0.1724138 0.1141304 
odds.ratio 1.5106732 1.0000000 


Now let’s review each line briefly to understand the analysis in more detail. 


dat <- matrix(c(30, 174, 21, 184), 2, 2) 
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We used the mat rix function to take a vector and convert it into a matrix with 2 
rows and 2 columns. Notice the mat rix function reads in the vector column-wise. 
To read the vector in row-wise we would add the by row=TRUE option. Try creating 
a matrix reading in a vector column-wise (default) and row-wise. 





rownames (dat) <- c("Deaths", "Survivors") 
colnames (dat) <- c("Tolbutamide", "Placebo") 


We used the rownames and the colnames functions to assign row and column 
names to the matrix dat. The row names and the column names are both character 
vectors. 


coltot <- apply(dat, 2, sum) #column totals 


We used the apply function to sum the columns; it is a versatile function for apply- 
ing any function to matrices or arrays. The second argument is the MARGIN option: 
in this case, MARGIN=2, meaning apply the sum function to the columns. To sum 
the rows, set MARGIN=1. 


risks <- dat["Deaths",]/coltot 
risk.ratio <- risks/risks[2] #risk ratio 


We calculated the risks of death for each treatment group. We got the numerator 
by indexing the dat matrix using the row name "Deaths". The numerator is a 
vector containing the deaths for each group and the denominator is the total number 
of subjects in each group. We calculated the risk ratios using the placebo group as 
the reference. 


odds <- risks/(1l-risks) 
odds.ratio <- odds/odds [2] #odds ratio 


Using the definition of the odds, we calculated the odds of death for each treatment 
group. Then we calculated the odds ratios using the placebo group as the reference. 


dat 
rbind(risks, risk.ratio, odds, odds.ratio) 


Finally, we display the dat table we created. We also created a table of results by 
row binding the vectors using the rbind function. 

In the sections that follow we will cover the necessary concepts to make the 
previous analysis routine. 


2.3.2 Creating matrices 


There are several ways to create matrices (Table 2.11 on the next page). In general, 
we create or use matrices in the following ways: 
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Table 2.11 Common ways of creating a matrix 











Function Description Try these examples 
cbind Column-bind vectorsor x <- 1:3 
matrices yo<s 33 
Zz <- cbhind(x, y); z 
rbind Row-bind vectors or zZz2 <- rbind(x, y); 22 
matrices 
matrix Generates matrix mtx <- matrix(1:4, nrow=2, ncol=2); 
mtx 
dim Assign dimensions toa =omtx2 <- 1:4; mtx2 
data object dim(mtx2) <- c(2, 2); mtx2 
array Generates matrix when mtx <- array(1:4, dim = c(2, 2)); 
array is 2-dimensional mtx 
table Creates contingency table(infert$educ, infert$case) 
table 
xtabs Create a contingency xtabs (“education + case, data = 
table using a formula infert) 
interface 
ftable Creates flat contingency ftable(infert$educ, infertS$spont, 
table infert$case) 
as.matrix Coerces object into a r3 
matrix as.matrix(1:3) 
outer Outer product of two outer (1:5, 145, Te") 
vectors 
x[row, , Indexing an array can x <- array(1:8, c(2, 2, 2)) 
] return a matrix fly. pty 
x. col, xb spy: 3 
] xf, ,1] 
xl , 
ep] 
e Contingency tables (cross tabulations) 
e Spreadsheet calculations and display 
e Collecting results into tabular form 
e Results of 2-variable equations 


2.3.2.1 Contingency tables (cross tabulations) 


In the previous section we used the mat rix function to create the 2 x 2 table for 


the UGDP clinical trial: 
> dat <- matrix(c(30, 174, 21, 184), 2, 2) 
> rownames(dat) <- c("Deaths", "Survivors") 
> colnames(dat) <- c("Tolbutamide", "Placebo") 
> dat 


Tolbutamide Placebo 
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Deaths 30 21 
Survivors 174 184 


Alternatively, we can create a 2-way contingency table using the table function 
with fields from a data set; 


> dat2 <- read.table("http://www.medepi.net/data/ugdp.txt", 
+ header = TRUE, sep = ",") 

> names(dat2) #display field names 

[1] "Status" "Treatment" "Agegrp" 

> table(dat2SStatus, dat2$Treatment) 








Placebo Tolbutamide 
Deaths 21 30 
Survivors 184 174 


Alternatively, the xtabs function cross tabulates using a formula interface. An 
advantage of this function is that the field names are included. 


> xtabs(~Status + Treatment, data = dat2) 


Treatment 
Status Placebo Tolbutamide 
Deaths 21 30 
Survivors 184 174 


Finally, a multi-dimensional contingency table can be presented as a 2-dimensional 
flat contingency table using the ft able function. Here we stratify the above table 
by the variable Agegrp. 


> xtab3way <- xtabs(~Status + Treatment + Agegrp, data=dat2) 
> xtab3way 
, , Agegrp = <55 


Treatment 
Status Placebo Tolbutamide 
Deaths 5 8 
Survivors 115 98 


, , Agegrp = 55+ 


Treatment 
Status Placebo Tolbutamide 
Deaths 16 22 
Survivors 69 76 


> ftable(xtab3way) #convert to flat table 
Agegrp <55 55+ 
Status Treatment 
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Deaths Placebo 5 6 
Tolbutamide 8 22 
Survivors Placebo TAS 69 
Tolbutamide 98 76 








> #alternative and more consistent with xtab3way 

> ftable(xtabs("“Agegrp + Status + Treatment, data=datZ2) ) 
Treatment Placebo Tolbutamide 

Agegrp Status 


<55 Deaths 5 8 
Survivors 115 98 
55+ Deaths 16 22 
Survivors 69 76 


2.3.2.2 Spreadsheet calculations and display 


Matrices are commonly used to display spreadsheet-like calculations. In fact, a very 
efficient way to learn R is to use it as our spreadsheet. For example, assuming the 
rate of seasonal influenza infection is 10 infections per 100 person-years, let’s cal- 
culate the individual cumulative risk of influenza infection at the end of 1, 5, and 10 
years. Assuming no competing risk, we can use the exponential formula: 


R(0,t) =1—e-*" 


where , A = infection rate, and tf = time. 


> lamb <- 10/100 

> years <- c(1, 5, 10) 

> risk <- 1 - exp(-lamb«tim) 

> chind(rate = lamb, years, cumulative.risk = risk) 
rate years cumulative.risk 

[1,] 0.1 1 0.09516258 

[2,] 0.1 5 0.39346934 

[3,] 0.1 10 0.63212056 


Therefore, the cumulative risk of influenza infection after 1, 5, and 10 years is 9.5%, 
39%, and 63%, respectively. 


2.3.2.3 Collecting results into tabular form 


A 2-way contingency table from the table or xtabs functions does not have 
margin totals. However, we can construct a numeric matrix that includes the totals. 
Using the UGDP data again, 


> dat2 <- read.table("http://www.medepi.net/data/ugdp.txt", 
+ header = TRUE, sep=",") 
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> tab2 <- xtabs(~Status + Treatment, data = dat2) 
> rowt <- tab2[,1] + tab2[,2] 
> tab2a <- cbhind(tab2, Total = rowt) 
> colt <- tab2a[1,] + tab2a[2,] 
> tab2b <- rbind(tab2a, Total = colt) 
> tab2b 
Placebo Tolbutamide Total 
Deaths 21 30 51 
Survivors 184 174 358 
Total 205 204 409 


This table (t ab2b) is primarily for display purposes. 


2.3.2.4 Results of 2-variable equations 


When we have an equation with 2 variables, we can use a matrix to display the 
answers for every combination of values contained in both variables. For example, 
consider this equation: 


Z=xy 


And suppose x = {1,2,3,4,5} and y = {6,7,8,9, 10}. Here’s the long way to create 
a matrix for this equation: 


> 
> 





ap WBN ER 


x <- 1:5; y <- 6:10 
z <- matrix(NA, 5, 5) #create empty matrix of missing values 
for(i in 1:5) { 
for(j in 1:5){ 
zfi, j] <- x[il¢tyl[3] 


} 
rownames(z) <- x; colnames(z) <- y 
Z 
6 7 8 9 10 
6 7 8 9 10 
12 14 16 18 20 
18 21 24 27 30 
24 28 32 36 40 
30 35 40 45 50 


Okay, but the outer function is much better for this task: 


> 
> 
> 


x <- 1:5; y <- 6:10 

Zz <- outer(x, y, "*") 

rownames(z) <- x; colnames(z) <- y 
Z. 
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Function Try these examples 
matrix #name rows and columns only 
dat <- matrix(c(178, 79, 1411, 1486), 2, 2, 
dimnames = list(c("Type A", "Type B"), 
c("Yes", "No"))) 
dat 
#name rows, columns, and fields 
dat <- matrix (co (L778, 99; 1411, 1486); 2,7 23 
dimnames = list (Behavior = c("Type A", "Type B"), 
"Heart attack" = c("Yes", "No"))) 
dat 
rownames #name rows only 
dat <- mairix(e (lye, 79, T4il, 1486), 27 2) 
rownames (dat) <- c("Type A", "Type B") 
dat 
colnames #name columns only 
dat <-— matrix({c(178, 79, 1411, 1486), 2, 2) 
colnames(dat) <- c("Yes", "No"); dat 
dimnames #name rows and columns only 
dat <= matrixzte(l7e, 79, 1411, 1486), 2, 2] 
dimnames (dat) <- list(c("Type A", "Type B"), 
c("Yes", "No") ) 
dat 
#name rows, columns, and fields 
dat. <= matrix(c(178, 79, 1411, 1486), 2, 2) 
dimnames (dat) <- list (Behavior = c("Type A", "Type B"), 
"Heart attack" = c("Yes", "No")) 
dat 
names #name fields when row and column names already exist 
dat <- matrix(c(178, 79, 1411, 1486), 2, 2, 
dimnames = list(c("Type A", "Type B"), 
c("Yes", "No"))) 
dat #display w/o field names 
names (dimnames(dat)) <- c("Behavior", “Heart attack") 
dat #display w/ field names 
212 14 16 18 20 
3.18 21 24 27 30 
4 24 28 32 36 40 
5 30 35 40 45 50 


In fact, the outer function can be used to calculate the “surface” for any 2-variable 


equation (more on this later). 
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2.3.3 Naming a matrix 


We have already seen several examples of naming components of a matrix. Ta- 
ble 2.12 on the facing page summarizes the common ways of naming matrix compo- 
nents. The components of a matrix can be named at the time the matrix is created, or 
they can be named later. For a matrix, we can provide the row names, column names, 
and/or field names. For example, consider the UGDP clinical trial 2 x 2 table: 


> tab 
Treatment 
Outcome Tolbutamide Placebo 
Deaths 30 21 
Survivors 174 184 


In the “Treatment” field, the possible values, “Tolbutamide” and “Placebo,” are the 
column names. Similarly, in the “Outcome” field, the possible values, “Deaths” and 
“Survivors,” are the row names. 

To review, the components of matrix can be named at the time the matrix is 
created: 


> tab <- matrix(c(30, 174, 21, 184), 2, 2, 


+ dimnames = list (Outcome = c("Deaths", "Survivors"), 
+ Treatment = c("Tolbutamide", "Placebo"))) 
> tab 
Treatment 
Outcome Tolbutamide Placebo 
Deaths 30 21 
Survivors 174 184 


If a matrix does not have field names, we can add them after the fact, but we must 
use the names and dimnames functions together. Having field names is necessary 
if the row and column names are not self-explanatory, as this example illustrates. 


> y <- matrix(c(30, 174, 21, 184), 2, 2) 
> rownames(y) <- c("Yes", "No") 
> colnames(y) <- c("Yes", "No") 
> y #labels not informative 
Yes No 
Yes 30 21 
No 174 184 
> #add field names 
> names (dimnames(y)) <- c("Death", "Tolbutamide") 
2? y 
Tolbutamide 
Death Yes No 
Yes 30 21 
No 174 184 


Study and test the examples in Table 2.12 on the preceding page. 
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Table 2.13 Common ways of indexing a matrix 





Indexing Try these examples 


By position dat <- matrix(c(178, 79, 1411, 1486), 2, 2) 
dimnames (dat) <- list (Behavior = c("Type A","Type 





BP yy 
"Heart attack" = c("Yes","No") ) 
dat[1, ] 
dat [1,2] 
dat[2, , drop = FALSE] 
By name (if ake priooe BM 
exists) dat["Type A", "Type B"] 
dat["Type B", , drop = FALSE] 
By logical dat[, 1] > 100 
dat[dat[, 1] > 100, ] 
dat[dat[, 1] > 100, , drop = FALSE] 





2.3.4 Indexing a matrix 


Similar to vectors, a matrix can be indexed by position, by name, or by logical. Study 
and practice the examples in Table 2.13. An important skill to master is indexing 
rows of a matrix using logical vectors. Consider the following matrix of data, and 
suppose I want to select the rows for subjects age less than 60 and systolic blood 
pressure less than 140. 


> dat 
age chol sbp 

45 145 124 

56 168 144 

73° 240 150 

44 144 134 

65 210 112 
> dat[,"age"]<60 
1] RUE TRUE FALSE TRUE FALSE 
> dat[,"sbp"]<140 
1] TRUE FALSE FALSE TRUE TRUE 
> tmp <- dat[,"age"]<60 & dat[,"sbp"]<140 
> tmp 
1] RUE FALSE FALSE TRUE FALSE 
> dat[tmp, ] 

age chol sbp 

1,] 45 145 124 
2,] 44 144 134 


x 


x 


x 


Oo eB WN ER 
x 


x 









































Notice that the t mp logical vector is the intersection of the logical vectors separated 
by the logical operator &. 
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Table 2.14 Common ways of replacing matrix elements 





Replacing Try these examples 
By position dat <- matrix(c(178, 79, 1411, 1486), 2, 2) 
dimnames (dat) <- list(c("Type A","Type B"), 
c("Yes","No") ) 
dat[l, ] <- 99 
dat 
By name (if dat["Type A", ] <- c(178, 1411) 
exists) dat 
By logical qq <- dat[ ,1]<100 
aq 
dat[qq, ] <- 99 
dat 
dat [dat[ ,1]<100, ] <- c(79, 1486) 
dat 








2.3.5 Replacing matrix elements 


Remember, replacing matrix elements is just indexing plus assignment: anything 
that can be indexed can be replaced. Study and practice the examples in Table 2.14. 


2.3.6 Operations on a matrix 


In epidemiology books, authors have preferences for displaying contingency tables. 
Software packages have default displays for contingency tables. In practice, we may 
need to manipulate a contingency table to facilitate further analysis. Consider the 
following 2-way table: 


> tab 
Treatment 
Outcome Tolbutamide Placebo 
Deaths 30 2) 
Survivors 174 184 


We can transpose the matrix using the t function. 


> t (tab) 
Outcome 
Treatment Deaths Survivors 
Tolbutamide 30 174 
Placebo 21 184 


We can reverse the order of the rows and/or columns. 


> tabi2sl;] #reverse rows 
Treatment 
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Table 2.15 Common ways of operating on a matrix 











Function Description Try these examples 
t Transpose matrix dat #from Table 2.14 on the 
preceding page 
t (dat) 
apply Apply a function tothe apply(X = dat, MARGIN = 2, FUN = 
margins of a matrix sum) 
apply(dat, 1, FUN=sum) 
apply(dat, 1, mean) 
apply(dat, 2, cumprod) 
sweep Return an array rsum <-— apply(dat, 1, sum) 
obtained from aninput rdist <-— sweep(dat, 1, rsum, "/") 
array by sweeping outa rdist 
summary statistic csum <- apply(dat, 2, sum) 
cdist <- sweep(dat, 2, csum, "/") 
cdist 
The following short-cuts use apply and/or sweep functions. 
margin.table | For acontingency table margin.table (dat) 
rowSums in array form, compute margin.table(dat, 1) 
colSums the sum of table entries rowSums(dat) #equivalent 
for a given index. These apply(dat, 1, sum) #equivalent 
functions are really just margin.table(dat, 2) 
the apply function colSums (dat) #equivalent to previous 
using sum. apply(dat, 2, sum) 
addmargins Calculate and display addmargins (dat) 
marginal totals of a 
matrix 
rowMeans For a contingency table rowSums (dat) 
colMeans in array form, compute apply(dat, 1, mean) #equivalent 
the mean of table entries colSums (dat) 
for a given index. These apply(dat, 2, mean) #equivalent 
functions are really just 
the apply function 
using mean. 
prop.table Short cut that uses the prop.table (dat) 
sweep and apply dat/sum (dat) 
functions to get margin prop.table(dat, 1) 
and joint distributions sweep(dat, 1, apply(y, 1, sum), "/") 
prop.table(dat, 2) 
sweep(y, 2, apply(y, 2, sum), "/") 
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Outcome Tolbutamide Placebo 
Survivors 174 184 
Deaths 30 21 
> tab[,2:1] #reverse columns 
Treatment 
Outcome Placebo Tolbutamide 
Deaths 21 30 
Survivors 184 174 
> tab[2:1,2:1 #reverse rows and columns 
Treatment 
Outcome Placebo Tolbutamide 
Survivors 184 174 
Deaths 21 30 





2.3.6.1 The apply function 


The apply function is an important and versatile function for conducting op- 
erations on rows or columns of a matrix, including user-created functions. The 
same functions that are used to conduct operations on single vectors (Table 2.8 on 
page 42) can be applied to rows or columns of a matrix. 

To calculate the row or column totals use the apply with the sum function: 


> tab 
Treatment 
Outcome Tolbutamide Placebo 
Deaths 30 2:1: 
Survivors 174 184 


> apply(tab, 1, sum) #row totals 
Deaths Survivors 





51 358 
> apply(tab, 2, sum) #column totals 
Tolbutamide Placebo 

204 205 


These operations can be used to calculate marginal totals and have them combined 
with the original table into one table. 


> tab 
Treatment 
Outcome Tolbutamide Placebo 
Deaths 30 21 
Survivors 174 184 
> rtot <- apply(tab, 1, sum) #row totals 
> tab2 <- cbhind(tab, Total = rtot) 
> tab2 





Tolbutamide Placebo Total 
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Deaths 30 21 51 
Survivors 174 184 358 
> ctot <- apply(tab2, 2, sum) #column totals 
> rbind(tab2, Total = ctot) 

Tolbutamide Placebo Total 


Deaths 30 21 51 
Survivors 174 184 358 
Total 204 205 409 


For convenience, R provides some functions for calculating marginal totals, 
and calculating row or column means (margin.table, rowSums, colSums, 
rowMeans, and colMeans). However, these functions just use the apply func- 
‘4 
tion’. 


Here’s an alternative method to calculate marginal totals: 


> tab 
Treatment 
Outcome Tolbutamide Placebo 
Deaths 30 21 
Survivors 174 184 
> tab2 <- cbhind(tab, Total=rowSums (tab) ) 





> rbind(tab2, Total=colSums (tab2) ) 
Tolbutamide Placebo Total 


Deaths 30 21 51 
Survivors 174 184 358 
Total 204 205 409 


For convenience, the addmargins function calculates and displays the marginals 
totals with the original data in one step. 


> addmargins (tab) 


Treatment 
Outcome Tolbutamide Placebo Sum 
Deaths 30 21. 51 
Survivors 174 184 358 
Sum 204 205 409 


The power of the apply function comes from our ability to pass many functions 
(including our own) to it. For practice, combine the apply function with functions 
from Table 2.8 on page 42 to conduct operations on rows and columns of a matrix. 


2.3.6.2 The sweep function 


The sweep function is another important and versatile function for conducting op- 
erations across rows or columns of a matrix. This function “sweeps” (operates on) 


4 More specifically, rowSums, colSums, rowMeans, and colMeans are optimized for speed. 
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a row or column of a matrix using some function and a value (usually derived from 
the row or column values). To understand this, we consider an example involving a 
single vector. For a given integer vector x, to convert the values of x into proportions 
involves two steps: 


Sox <o eC; 25. 3. 45.5) 


> sumx <- sum(x) #Step 1: summation 
> propx <- x/sumx #Step 2: division (the "sweep") 
> propx 


[1] 0.066667 0.133333 0.200000 0.266667 0.333333 


To apply this equivalent operation across rows or columns of a matrix requires the 
sweep function. 

For example, to calculate the row and column distributions of a 2-way table we 
combine the app1y (step 1) and the sweep (step 2) functions: 


> tab 
Treatment 
Outcome Tolbutamide Placebo 
Deaths 30 ZY 
Survivors 174 184 


> rtot <- apply(tab, 1, sum) #row totals 
tab.rowdist <- sweep(tab, 1, rtot, "/") 
> tab.rowdist 

Treatment 

Outcome Tolbutamide Placebo 

Deaths 0.58824 0.41176 

Survivors 0.48603 0.51397 

> ctot <- apply(tab, 2, sum) #column totals 
tab.coldist <- sweep(tab, 2, ctot, "/") 
> tab.coldist 

Treatment 

Outcome Tolbutamide Placebo 

Deaths 0.14706 0.10244 

Survivors 0.85294 0.89756 


Vv 


Vv 








Because R is a true programming language, these can be combined into single steps: 


> sweep(tab, 1, apply(tab, 1, sum), "/") #row distribution 


Treatment 
Outcome Tolbutamide Placebo 
Deaths 0.58824 0.41176 
Survivors 0.48603 0.51397 
> sweep(tab, 2, apply(tab, 2, sum), "/") #column distribution 
Treatment 
Outcome Tolbutamide Placebo 
Deaths 0.14706 0.10244 
Survivors 0.85294 0.89756 
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Table 2.16 Deaths among subjects who received tolbutamide and placebo in the Unversity Group 
Diabetes Program (1970), stratifying by age 














Age<55 Age>55 Combined 
Tolbutamide Placebo ‘Tolbutamide Placebo Tolbutamide Placebo 
Deaths 8 5 22 16 30 21 
Survivors 98 115 76 69 174 184 
Total 106 120 98 85 204 205 





For convenience, R provides prop.table. However, this function just uses the 
apply and sweep functions. 


2.4 An array is a n-dimensional table of like elements 


2.4.1 Understanding arrays 


While a matrix is a 2-dimensional table of like elements, an array is the general- 
ization of matrices to n-dimensions. Stratified contingency tables in epidemiology 
are represented as array data objects in R. For example, the randomized clinical 
trial previously shown comparing the number deaths among diabetic subjects that 
received tolbutamide vs. placebo is now also stratified by age group (Table 2.16): 

This is 3-dimensional array: outcome status vs. treatment status vs. age group. 
Let’s see how we can represent this data in R. 


> tdat <- -e(8, -98,- 5; 115,22; 76,16; 69) 

> tdat <- array(tdat, c(2, 2, 2)) 

> dimnames(tdat) <- list (Outcome=c("Deaths", "Survivors"), 
+ Treatment=c("Tolbutamide", "Placebo"), 

+ "Age group"=c ("Age<55", "Age>=55") ) 

> tdat 

, , Age group = Age<55 


Treatment 
Outcome Tolbutamide Placebo 
Deaths 8 5 
Survivors 98 115 


, , Age group = Age>=55 


Treatment 
Outcome Tolbutamide Placebo 
Deaths 22 16 
Survivors 76 69 
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Table 2.17 Example of 4-dimensional array: Year 2000 population estimates by age, ethnicity, 
sex, and county 





Ethnicity 
County/Sex Age White AfrAmer AsianPI Latino Multirace AmerInd 
Alameda 

















Female <=19 58160 31765 40653 49738 10120 839 
20-44 112326 44437 72923 58553 7658 1401 
45-64 82205 24948 33236 18534 2922 822 
65+ 49762 12834 16004 7548 1014 246 
Male <=19 61446 32277 42922 53097 10102 828 
20-44 115745 36976 69053 69233 6795 1263 
45-64 81332 20737 =29841 17402 2506 687 
65+ 33994 8087 11855 5416 711 156 
San Francisco 
Female <=19 14355 6986 23265 13251 2940 173 
20-44 85766 10284 52479 23458 3656 526 
45-64 35617 6890 =31478 9184 1144 282 
65+ 27215 5172 23044 5773 554 121 
Male <=19 14881 6959 24541 14480 2851 165 
20-44 105798 11111 48379 31605 3766 782 
45-64 43694 7352 26404 8674 1220 354 
65+ 20072 3329 ~=17190 3428 450 76 





R displays the first stratum (tdat [, , 1]) then the second stratum (tdat [,,2]). 
Our goal now is to understand how to generate and operate on these types of arrays. 
Before we can do this we need to thoroughly understand the structure of arrays. 

Let’s study a 4-dimensional array. Displayed in Table 2.17 is the year 2000 pop- 
ulation estimates for Alameda and San Francisco Counties by age, ethnicity, and 
sex. The first dimension is age category, the second dimension is ethnicity, the third 
dimension is sex, and the fourth dimension is county. Learning how to visualize this 
4-dimensional sturcture in R will enable us to visualize arrays of any number of 
dimensions. 

Displayed in Figure 2.4.1 on the next page is a schematic representation of the 
4-dimensional array of population estimates in Table 2.17. The left cube represents 
the population estimates by age, race, and sex (dimensions 1, 2, and 3) for Alameda 
County (first component of dimension 4). The right cube represents the population 
estimates by age, race, and sex (dimensions 1, 2, and 3) for San Francisco County 
(second component of dimension 4). We see, then, that it is possible to visualize 
data arrays in more than three dimensions. 

To convince ourselves further, displayed in Figure 2.4.1 on the next page is a 
theorectical 5-dimensional data array. Suppose this 5-D array contained data on 
age (“Young”, “Old’), ethnicity “White”, “Nonwhite’’), sex (“Male’’, “Female’’), 
party affiliation (“Democrat’, “Republican”), and state (“California’, “Washington 
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2. Race 2. Race 


1. Age 
1. Age 








4. County 


Fig. 2.2 Schematic representation of a 4-dimensional array (Year 2000 population estimates by 
age, race, sex, and county) 





5 


Fig. 2.3 Schematic representation of a theoretical 5-dimensional array (possibly population esti- 
mates by age (1), race (2), sex (3), party affiliation (4), and state (5)). From this diagram, we can 
infer that the field “state” has 3 levels, and the field “party affiliation” has 2 levels; however, it 
is not apparent how many age levels, race levels, and sex levels have been created. Although not 
displayed, age levels would be represented by row names (along Ist dimension), race levels would 
be represented by column names (along 2nd dimension), and sex levels would be represented by 
depth names (along 3rd dimension). 


State”, “Florida’’). For practice, using fictitious data, try the following R code and 
study the output: 


tab5 <- array(1:48, dim = c(2,2,2,2,3)) 

dni <- c("Young", "Old") 

dn2 <- c("White", "Nonwhite") 

dn3 <- c("Male", "Female") 

dn4 <- c("Democrat", "Republican") 

dn5 <- c("California", "Washington State", "Florida") 





Q 


dimnames (tab5) <- list (Age=dnl, Race=dn2, Sex=dn3, Party=dn4, 


State=dn5) 
tab5 
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Table 2.18 Common ways of creating arrays 








Function Description Try these examples 

array Reshapes vector into an aa <- array(1:12, dim = c(2, 3, 2)) 
array aa 

table Creates n-dimensional data(infert) # load infert data set 
contingency table fromn table(infertS$educ, infertSspont, 
vectors infertScase) 

xtabs Creates a contingency xtabs (“education + case + parity, 
table from data = infert) 


cross-classifying factors 
contained in a data frame 
using a formula interface 


as.table Creates n-dimensional ft <- ftable(infert$educ, 
contingency table from infert$spont, 
n-dimensional ftable infert$case) 
ft 
as.table(ft) 
dim Assign dimensions to a x <- 1:12 
data object x 
dim(x) <- c(2, 3, 2) 
x 





2.4.2 Creating arrays 


In R, arrays are most often produced with the array, table, or xt abs functions 
(Table 2.18). As in the previous example, the array function works much like the 
matrix function except the array function can specify | or more dimensions, 
and the mat rix function only works with 2 dimensions. 


> array(l, dim = 1) 

[1] 1 

> array(1, dim = c(1, 1)) 
[,1] 


[1,] 1 
> array(1, dim = c(1, 1, 1)) 
LA Tv 1 
[,1] 
[1,] 1 


The table function cross tabulates 2 or more categorical vectors: character vec- 
tors or factors. In R, categorical data are represented as factors (more on this later). 
In contrast, using a formula interface, the xt abs function cross tabulates 2 or more 
factors from a data frame. Additionally, the xt abs function includes field names 
(which is highly preferred). For illustration, we will cross tabulate character vectors. 


> #read in data "as is" (no factors created) 
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> udatl <- read.csv("http://www.medepi.net/data/ugdp.txt", 
+ as.is = TRUE) 

> str (udatl) 

‘data.frame’: 409 obs. of 3 variables: 





S$ Status : chr "Death" "Death" "Death" "Death" 
$ Treatment: chr "Tolbutamide" "Tolbutamide" "Tolbutamide" 
$ Agegrp 7 chr Me bm we55" we55" Ne5He 


> table (udat1$Status, udat1lSTreatment, udatl$Agegrp) 
pe SY St 


Placebo Tolbutamide 


Death 16 22 
Survivor 69 76 
ror = <55 


Placebo Tolbutamide 
Death 5 8 
Survivor 115 98 


The xtabs function will not work on character vectors. 
By default, R converts character fields into factors. With factors, both the table 
and xt abs functions cross tabulate the fields. 


> #read in data and convert character vectors to factors 
> udat2 <- read.csv("http://www.medepi.net/data/ugdp.txt") 
> str (udat2) 




















‘data.frame’: 409 obs. of 3 variables: 

S$ Status : Factor w/ 2 levels "Death","Survivor": 1111 
S$ Treatment: Factor w/ 2 levels "Placebo","Tolbutamide": 2 2 
S Agegrp : Factor w/ 2 levels "55+","<55": 222222 2 

> table (udat2$Status, udatl$STreatment, udatl$Agegrp) 


nop Se 


Placebo Tolbutamide 


Death 16 22 
Survivor 69 76 
ror = <55 


Placebo Tolbutamide 
Death 2 8 
Survivor 115 98 


> xtabs(~Status + Treatment + Agegrp, data = udat2) 
, , Agegrp = 55+ 
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Treatment 
Status Placebo Tolbutamide 
Death 16 22 
Survivor 69 76 


, , Agegrp = <55 


Treatment 
Status Placebo Tolbutamide 
Death 5 8 
Survivor 115 98 


Notice that the xt abs function above included the field names. Field names can be 
added manually with the table functions: 


> table (Outcome = udat2S$Status, Therapy = udatl$Treatment, 
+ Age = udat1lSAgegrp) 
, , Age = 55+ 


Therapy 
Outcome Placebo Tolbutamide 
Death 16 22 
Survivor 69 76 


, , Age = <55 


Therapy 
Outcome Placebo Tolbutamide 
Death 5 8 
Survivor 115 98 


Recall that the ft able function creates a flat contingency from categorical vec- 
tors. The as . table function converts the flat contingency table back into a multi- 
dimensional array. 


> ftab <- ftable(udat2SAgegrp, udatlSTreatment, udat1S$Status) 
> ftab 
Death Survivor 











<55 Placebo 5 115 
Tolbutamide 8 98 

55+ Placebo 16 69 
Tolbutamide 22 716 

> as.table(ftab) 

Pa = Death 
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2 Working with R data objects 








Function Try these examples 
array # name components at time array is created x <- c(140, 
IT; 2807 (56,:-20776 99 2755-32) 


# create labels for values of each dimension 


rn <- c(">=1 cups per day", 


"0 cups per day") 


cn <- c("Cases", "Controls") 
dn <- c("Females", "Males") 
x <- array(x, dim = c(2, 2, 2), dimnames = list (Coffee 
= rn, 
Outcome = cn, Gender = dn)) 
x 
dimnames x <- c(140, 11, 280, 56, 207, 9, 275, 32) 


# create labels for values of each dimension 


rn <- c(">=1 cups per day", 
cn <- c("Cases", 
dn <- c("Females", "Males") 
x <- array(x, dim = c(2, 2, 
dimnames(x) <- list (Coffee 
dn) 
x 

names x <- c(140, 11, 


280, 56, 


"Controls") 


207, 


"0 cups per day") 


2)) 


rn, Outcome = cn, Gender = 


97-295 5-32) 


# create labels for values of each dimension 


rn <- c(">=1 cups per day", 
cn <- c("Cases", 
dn <- c("Females", "Males") 
x <- array(x, dim = c(2, 2, 
dimnames(x) <- list(rn, cn, 
x # display w/o field names 
names (dimnames (x) ) 
x # display w/ field names 


"Controls") 


<-— c("Coffee", 


"0 cups per day") 
2)) 
dn) 


"Case status", "Sex") 





Placebo Tolbutamide 
<55 5 8 
55+ 16 22 


= Survivor 
Placebo Tolbutamide 


115 98 
69 76 


<55 
5D 
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Table 2.20 Common ways of indexing arrays 








Indexing Try these examples 
By position # use x from Table 2.19 
xfl, ,] 
xf ,2, ] 
By name (if exists) x[ , ,"Males"] 
xl 
,"Controls","Females"] 
By logical vector zz <- x[ ,1,1]>50 
ZZ 
RZiZige. pec) 





Table 2.21 Common ways of replacing array elements 


Replacing Try these examples 

By position # use x from Table 2.19 
x[1, 1, 1] <- NA 
x 





By name (if exists) x[ 
,"Controls","Females"] 
<- 99 
x 

By logical x>200 
x[x>200] <- 999 
x 


2.4.3 Naming arrays 


Naming components of an array is an extension of naming components of a matrix 
(Table 2.12 on page 54). Study and implement the examples in Table 2.19 on the 
facing page. 


2.4.4 Indexing arrays 


Indexing an array is an extension of indexing a matrix (Table 2.13 on page 56). 
Study and implement the examples in Table 2.20. 


2.4.5 Replacing array elements 


Replacing elements of an array is an extension of replacing elements of a matrix 
(Table 2.14 on page 57). Study and implement the examples in Table 2.21. 
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Table 2.22 Common ways of operating on an array 








Function Description Try these examples 
aperm Transpose an array by x <> array (1324) C25. Si 252-29) 
permuting its dimensions x 
and optionally resizingit aperm(x, c(3, 2, 1, 4)) 
apply Apply a function to the apply(x, 1, sum) 
margins of an array apply(x, c(2, 3), sum) 
apply(x, c(1, 2, 4), sum) 
sweep Return an array obtained zz <- apply(x, c(1, 2), sum) 


from an input array by sweep(x, c(1, 2), zz, "/") 
sweeping out a summary 
Statistic 


The following short-cuts use apply and/or sweep functions 


margin.table 


rowSums 


colSums 


addmargins 


rowMeans 


colMeans 


prop.table 


For a contingency tablein margin.table(x); sum(x) 

array form, compute the margin.table(x, c(1, 2)) 

sum of table entries fora apply(x, c(1, 2), sum) #equiv 

given index margin.table(x, c(1, 2, 4)) 
apply(x, c(1, 2, 4), sum) #equiv 


Sum across rows of an rowSums (x) # dims = 1 

array apply(x, 1, sum) #equiv 
rowSums (x, dims = 2) 
apply(x, c(1, 2), sum) #equiv 


Sum down columns ofan colSums(x) # dims = 
array apply(x, c(2, 3, 4), sum) #equiv 
colSums(x, dims = 2) 
apply(x, c(3, 4), sum) #equiv 


Calculate and display addmargins (x) 

marginal totals of an array 

Calculate means across rowMeans (x) # dims = 1 

rows of an array apply(x, 1, mean) #equiv 
rowMeans(x, dims = 2) 


apply(x, c(1, 2), mean) #equiv 


Calculate means down colMeans(x) # dims = 1 
columns of an array apply(x, c(2, 3, 4), mean) #equiv 
colMeans(x, dims = 2) 


apply(x, c(3, 4), mean) #equiv 


Generates distribution for prop.table (x) 

dimensions that are prop.table(x, margin = 1) 
summed in the prop.table(x, c(1, 2)) 
margin.table function prop.table(x, c(1, 2, 3)) 





2.4.6 Operations on arrays 


With the exception of the aperm function, operating on an array (Table 2.22) is an 
extension of operating on a matrix (Table 2.15 on page 58). Consider the number 
of primary and secondary syphilis cases in the United State, 1989, stratified by sex, 
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Table 2.23 Example of 3-dimensional array with marginal totals: Primary and secondary syphilis 
morbidity by age, race, and sex, United State, 1989 





























Ethnicity 
Age (years) Sex White Black Other Total 
<14 Male 2 31 7 40 
Female 14 165 11 190 
Total 16 196 18 230 
15-19 Male 88 1412 210 ~=1710 
Female 253 2257 158 2668 
Total 341 3669 368 = 4378 
20-24 Male 407 4059 654 5120 
Female 475. 4503 307. 5285 
Total 882 8562 961 10405 
25-29 Male 550 = 4121 633-5304 
Female 433 3590 283 = 4306 
Total 983-7711 916 9610 
30-34 Male 564 4453 520 5537 
Female 316 2628 167 = 3111 
Total 880 7081 687 8648 
35-44 Male 654 3858 492 5004 
Female 243 1505 149 1897 
Total 897 5363 641 6901 
45-54 Male 323 1619 202 = 2144 
Female 55 392 40 487 
Total 378 =. 2011 242 82631 
55+ Male 216 823 108 1147 
Female 24 92 15 131 
Total 240 915 123 1278 





Total (all ages) Male 2804 20376 2826 26006 
Female 1813 15132 1130 18075 





Total 4617 35508 3956 44081 





ethnicity, and age (Table 2.23). This table contains the marginal and joint distribu- 
tion of cases. Let’s read in the original data and reproduce the table results. 


> sdat3 <- read.csv("http://www.medepi.net/data/syphilis89c.txt") 
> str(sdat3) 

















‘data.frame’: 44081 obs. of 3 variables: 
S$ Sex : Factor w/ 2 levels "Male","Female": 11113131311 
S$ Race: Factor w/ 3 levels "White","Black",..: 111111 
S$ Age : Factor w/ 8 levels "<=14","15-19",..: 112222 2 
> sdat3[1:5,] #display first 5 lines 
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Sex Race Age 
Male White <=14 
Male White <=14 
Male White 15-19 
le White 15-19 
Male White 15-19 
sdat <- xtabs(~Sext+RacetAge, data=sdat3) #create array 
sdat 
, , Age = <=14 





VV OP WBN EH 
< 
o 
o 


Race 
Sex White Black Other 
Male Z 31 7 
Female 14 Li65 11 


, , Age = 15-19 


Race 
Sex White Black Other 
Male 88 1412 210 
Female 253 2257 158 


, , Age = 20-24 


Race 
Sex White Black Other 
Male 407 4059 654 


Female 475 4503 307 


, , Age = 25-29 


Race 
Sex White Black Other 
Male 550 4121 633 


Female 433 3590 283 


RF AGS SH 30-34 


Race 
Sex White Black Other 
Male 564 4453 520 


Female 316 2628 167 


rp 7 AGE = 35-44 
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Race 
Sex White Black Other 
Male 654 3858 492 


Female 243 1505 149 


, , Age = 45-54 


Race 
Sex White Black Other 
Male 323 1619 202 
Female Do 392 40 


, , Age = 55+ 


Race 
Sex White Black Other 
Male 216 823 108 
Female 24 92 15 


To get marginal totals for one dimension, use the apply function and specify 
the dimension for stratifying the results. 


> sum(sdat) #total 
[1] 44081 
> apply(X = sdat, MARGIN = 1, FUN = sum) #by sex 
Male Female 
26006 18075 
> apply(sdat, 2, sum) #by race 
White Black Other 
4617 35508 3956 
> apply(sdat, 3, sum) #by age 
<=14 15-19 20-24 25-29 30-34 35-44 45-54 55+ 
230 4378 10405 9610 8648 6901 2631 1278 








To get the joint marginal totals for 2 or more dimensions, use the apply function 
and specify the dimensions for stratifying the results. This means that the function 
that is passed to app1y is applied across the other, non-stratified dimensions. 


> apply(sdat, c(1, 2), sum) #by sex and race 
Race 
Sex White Black Other 
Male 2804 20376 2826 
Female 1813 15132 1130 
> apply(sdat, c(1, 3), sum) #by sex and age 
Age 
Sex <=14 15-19 20-24 25-29 30-34 35-44 45-54 55+ 
Male 40 1710 5120 5304 5537 5004 2144 1147 
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Female 190 2668 5285 4306 3111 1897 487 131 

> apply(sdat, c(3, 2), sum) #by age and race 
Race 

Age White Black Other 

<=14 16 196 18 

15-19 341 3669 368 

20-24 882 8562 961 

25-29 983 7711 916 

30-34 880 7081 687 

35-44 897 5363 641 

45-54 378 2011 242 

55+ 240 915 123 


In R, arrays are displayed by the Ist and 2nd dimensions, stratified by the re- 
maining dimensions. To change the order of the dimensions, and hence the display, 
use the aperm function. For example, the syphilis case data is most efficiently dis- 
played when it is stratified by race, age, and sex: 


> sdat.ras <- aperm(sdat, c(2, 3, 1)) 
> sdat.ras 
, , Sex = Male 








Age 
Race <=14 15-19 20-24 25-29 30-34 35-44 45-54 554 
White 2 88 407 550 564 654 323 216 
Black 31 1412 4059 4121 4453 3858 1619 823 
Other 7 210 654 633 520 492 202 108 


, , Sex = Female 








Age 
Race <=14 15-19 20-24 25-29 30-34 35-44 45-54 554 
White 14 253 475 433 316 243 55 24 
Black 165 2257 4503 3590 2628 1505 392. 92 
Other 11 158 307 283 167 149 40 15 


Another method for changing the display of an array is to convert it into a flat 
contingency table using the ft able function. For example, to display Table 2.23 
on page 71 as a flat contingency table in R (but without the marginal totals), we use 
the following code: 


> sdat.asr <- aperm(sdat, c(3,1,2)) #rearrange to age, sex, race 
> ftable(sdat.asr) #convert 2-D flat table 
Race White Black Other 


Age Sex 
<=14 Male zZ 31 7 
Female 14 165 1.1. 
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15-19 


20-24 


25-29 


30-34 


35-44 


45-54 


30+ 








Male 


Femal 





Femal 


e 


88 
253 
407 
475 
550 
433 
564 
316 
654 
243 
323 

5D 
216 

24 


1412 
2257 
4059 
4503 
4121 
3590 
4453 
2628 
3858 
1505 
1619 
392 
823 
92 


210 
158 
654 
307 
633 
283 
520 
167 
492 
149 
202 

40 
108 

15 


75 


This ft able object can be treated as a matrix, but it cannot be transposed. Notice 
that we can combine the ft able with addmargins: 


> ftable (addmargins (sdat.asr) ) 


Age 
L5ai9 


20-24 


25-29 


30-34 


35-44 


45-54 


>55 


Sum 
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Le 


Race Black Other White 


2257 158 
1412 210 
3669 368 
4503 307 
4059 654 
8562 961 
3590 283 
4121 633 
7711 916 
2628 167 
4453 520 
7081 687 
1505 149 
3858 492 
5363 641 
392 40 
1619 202 
2011 242 
165 11 
31 7 
196 18 
92 15 
823 108 
915 123 
15132 1130 
14-Oct-2013 


253 
88 
341 
475 
407 
882 
433 
550 
983 
316 
564 
880 
243 
654 
897 
55 
323 
378 
14 
2 
16 
24 
216 
240 
181:3 


Sum 


2668 
1710 
4378 
5285 
5120 
10405 
4306 
5304 
9610 
Sse 
DDS) 
8648 
1897 
5004 
6901 
487 
2144 
2631 
190 
40 
230 
13a 
1147 
1278 
18075 
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Male 20376 2826 2804 26006 
Sum 35508 3956 4617 44081 


To share the U.S. syphilis data in a universal format, we could create a text file 
with the data in a tabular form. However, the original, individual-level data set has 
over 40,000 observations. Instead, it would be more convenient to create a group- 
level, tabular data set using the as . data. frame function on the data array object. 


> sdat.df <- as.data.frame (sdat) 
> str(sdat.df) 











‘data.frame’: 48 obs. of 4 variables: 

S$ Sex : Factor w/ 2 levels "Male","Female": 1212121212 

S$ Race: Factor w/ 3 levels "White","Black",..: 1122331122 
S Age : Factor w/ 8 levels "<=14","15-19",..: 1111112222 








$ Freq: num 2 14 31 165 7 
> sdat.df[1:8,] 
Sex Race Age Freq 
Male White <=14 2 
Female White <=14 14 
Male Black <=14 3: 
Female Black <=14 165 
Male Other <=14 ch 
Female Other <=14 11 
Male White 15-19 88 
Female White 15-19 253 














AANA UO BWNE 


For additional practice, study and implement the examples in Table 2.22 on 
page 70. 


2.5 A list is a collection of like or unlike data objects 
2.5.1 Understanding lists 


Up to now, we have been working with atomic data objects (vector, matrix, array). 
In contrast, lists, data frames, and functions are recursive data objects. Recursive 
data objects have more flexibility in combining diverse data objects into one object. 
A list provides the most flexibility. Think of a list object as a collection of “bins” 
that can contain any R object (see Figure 2.5.1 on the facing page). Lists are very 
useful for collecting results of an analysis or a function into one data object where 
all its contents are readily accessible by indexing. 

For example, using the UGDP clinical trial data, suppose we perform Fisher’s 
exact test for testing the null hypothesis of independence of rows and columns in a 
contingency table with fixed marginals. 


> udat <- read.csv("http://www.medepi.net/data/ugdp.txt") 
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Fig. 2.4 Schematic representation of a list of length four. The first bin [1] contains a smiling face 

[ [1] 1], the second bin [2] contains a flower [ [2] ], the third bin [3] contains a lightning bolt 
[{3]], and the fourth bin [[4]] contains a heart [ [4] ]. When indexing a list object, single 
brackets [-] indexes the bin, and double brackets [ [-] ] indexes the bin contents. If the bin has a 
name, then $name also indexes the contents. 


> tab <- table(udatS$Status, udatSTreatment) [,2:1] 
> tab 


Tolbutamide Placebo 


Death 30 21 
Survivor 174 184 
> ftab <- fisher.test (tab) 
> ftab 


Fisher’s Exact Test for Count Data 


data: tab 
p-value = 0.1813 
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval: 

0.80138 2.88729 

sample estimates: 
odds ratio 

1.5091 


The default display only shows partial results. The total results are stored in the 
object ft ab. Let’s evaluate the structure of ft ab and extract some results: 


> str(ftab) 
List of 7 
S p.value : num 0.181 
S$ conf.int : atomic [1:2] 0.801 2.887 
-- attr(*, "conf.level")= num 0.95 
S$ estimate : Named num 1.51 
-- attr(*, "names")= chr "odds ratio" 
S$ null.value : Named num 1 
-- attr(«*, "names")= chr "odds ratio" 
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S$ alternative: chr "two.sided" 
S$ method : chr "Fisher’s Exact Test for Count Data" 
S$ data.name : chr "tab" 


- attr(*, "class")= chr "htest" 
> ftabSestimate 
odds ratio 
1.5091 
> ftabSconf.int 
[1] 0.80138 2.88729 
> ftabSconf.int[2] 
[1] 2.887286 
attr(,"conf.level") 





Pe]. 0295 
> ftabSp.value 
[1] 0.18126 


Using the st r function to evaluate the structure of an output object is a common 
method employed to extract additional results for display or further analysis. In this 
case, ft ab was a list with 7 bins, each with a name. 


2.5.2 Creating lists 


To create a list directly, use the 1ist function. A list is a convenient method to save 
results in our customized functions. For example, here’s a function to calculate an 
odds ratio from a 2 x 2 table: 


orcale <-— function (x) { 
or <- (x[1,1]«*x[2,2])/(x[1,2]*x[2,1]) 
pval <- fisher.test (x)Sp.value 
list (data = x, odds.ratio = or, p.value = pval) 


} 


The orcalc function has been loaded in R, and now we run the function on the 
UGDP data. 


> tab #display 2x2 table 


Tolbutamide Placebo 





Death 30 21 

Survivor 174 184 
> orcalc(tab) #run function 
Sdata 


Tolbutamide Placebo 
Death 30 21: 
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Table 2.24 Common ways of creating a list 





Function Description Try these examples 





list Creates list object x <- 1:3 
yo<= mabrix(ce(™a",."c",."b", 7d"), 272) 
z <- c("Pedro", "Paulo", "Maria") 
mm <- list(x, y, 2) 
mm 
data.frame List in tabular x <- 
format where each data. frame (id=1:3,sex=c("M","F","T") ) 
“bin” has a vector x 


of same length mode (x) 
class (x) 
as.data.frame Coerces data Re S= mMabrix().6, 2,03) 
object intoadata x 
frame y <- as.data.frame (x) 
y 
read.table Reads ASCII text wegs <- read.table(".../wegs.txt", 
read.csv file into data header=TRUE, sep=",") 
read.delim frame object! str (wcgs) 
read.fmf 
vector Creates empty list vector("list", 2) 
of length n 
as. list Coercion into list list(1:2) # compare to as.list 
object as.list (1:2) 





1. Try read. table ("http://www.medepi.net/data/wegs.txt", header=TRUE, 


Survivor 174 184 


Sodds.ratio 


[1] 1.5107 
Sp.value 
[1] 0.18126 


For additional practice, study and implement the examples in Table 2.24. 


2.5.3 Naming lists 


Components (bins) of a list can be unnamed or named. Components of a list can be 
named at the time the list is created or later using the names function. For practice, 
try the examples in Table 2.25 on the following page. 
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Table 2.25 Common ways of naming lists 


2 Working with R data objects 





Function Try these examples 





names z <- list(rnorm(20), 
Zz 
# name after creation of list 


names (Zz) <- o("binl",. "bin2", 


Zz 
# name at creation of list 
z <- list(binl = rnorm(20), 
Ts3) 
Zz 
# 


“Luis 3 


bin2 


"bin3") 


= "Luis", bin3 = 


without assignment returns character vector 


names (z) 





Table 2.26 Common ways of indexing lists 








Indexing Try these examples 

By position z <- list(binl = rnorm(20), bin2 = "Luis", bin3 = 
Le3') 
z[1] # indexes "bin" #1 


z[{1]] # indexes contents of 
By name (if exists) zSbinl 

zSbin2 
Indexing by num <- sapply(z, is.numeric) 
logical vector num 

z [num] 


"bin" #1 





2.5.4 Indexing lists 


If list components (bins) are unnamed, we can index the list by bin position with 
single or double brackets. The single brackets [-] indexes one or more bins, and the 


double brackets indexes contents of single bins only. 


> mylistl <- list(1:5, 
> mylistl[c(1, 3)] 
[[1] 
[1] 12345 


[1] "Juan Nieve" "Guill 








> mylist1[[3]] 
[1] "Juan Nieve" 








"Guil 





matrix(1:4,2,2), 
#index bins 1 and 3 


c("Juan Nieve", 


ermo Farro" 


#index contents of 3rd bin 
lermo Farro" 


When list bins are named, we can index the bin contents by name. Using the 
matched case-control study infert data set, we will conduct a conditional logistic 
regression analysis to determine if spontaneous and induced abortions are indepen- 
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dently associated with infertility. For this we’ll need to load the survival package 
which contains the clogit function. 














> data(infert) 
> library (survival) 
> modl <- clogit(case ~ spontaneous + induced + strata(stratum), 
+ data = infert) 
> modl #default display 
Call: 
clogit (case ~ spontaneous + induced + strata(stratum), data = infert) 
coef exp(coef) se(coef) Zz p 
spontaneous 1.99 7.29 0.352 5.63 1.8e-08 
induced 1.41 4.09 0.361 3.91 9.4e-05 
Likelihood ratio test=53.1 on 2 df, p=2.87e-12 n= 248 
> str (mod1) #evaluate structure 
List of 17 
S$ coefficients : Named num [1:2] 1.99 1.41 
-.-7> attr(*«, "names")= chr [1:2] "spontaneous" "induced" 
S$ var : num [1:2, 1:2] 0.1242 0.0927 0.0927 0.1301 
S$ loglik : num [1:2] -90.8 -64.2 
S$ score : num 48.4 
S$ iter : int 5 
> names(modl) #names of list components 
[1] "coefficients" "var" "loglik" 
[4] "score" "iter" "linear.predictors" 
[7] "residuals" "means" "method" 
[10] "n" "terms" "assign" 
[13] "wald.test" myn "formula" 
[16] "call" "userCall" 
> modlScoeff 
spontaneous induced 
1.9859 1.4090 


The results from str (mod1) are only partially displayed. Sometimes it is more 
convenience to display the names for the list rather than the complete structure. 
Additionally, the summary function applied to a regression model object creates a 
list object with more detailed results. This too has a default display, or we can index 
list components by name. 


> summodl <- summary (mod1) 


> summodl #default display of more detailed results 
Call: 
coxph(formula = Surv(rep(1, 248), case) ~ spontaneous + 
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Table 2.27 Common ways of replacing list components 








Replacing Try these examples 

By position z <- list(binl = rnorm(20), bin2 = "Luis", bin3 = 
L323.) 
z[{1] <- list(c(2, 3, 4)) # replaces "bin" contents 
z[[1]] <- c(2, 3, 4) # replaces "bin" contents 


By name (if exists) zSbin2 <- c("Tomas", "Luis", "Angela") 
Zz 
# replace name of specific "bin" 
names (z) [2] <- "mykids" 
Zz 
By logical num <- sapply(z, is.numeric) 
num 
z[num]<- list (rnorm(10), rnorm(10) ) 
Zz 








induced + strata(stratum), data = infert, method = "exact") 


n= 248 
coef exp(coef) se(coef) vA p 
spontaneous 1.99 Pv29 0.352 5.63 1.8e-08 
induced 1.41 4.09 0.361 3.91 9.4e-05 


exp (coef) exp(-coef) lower .95 upper .95 
spontaneous 7.29 0.137 3°65 14.5 
induced 4.09 0.244 2.02 8.3 


Rsquare= 0.193 (max possible= 0.519 ) 





Likelihood ratio test= 53.1 on 2 df, p=2.87e-12 
Wald test = 31.8 on 2 df, p=1.22e-07 
Score (logrank) test = 48.4 on 2 df, p=3.03e-11 
> names(summod1l) #names of list components 
[1] "call" "fail" "na.action" "n" ECS" 
[6] "coef" "conf.int" "logtest" "sctest" "rsg" 
[11] "waldtest" "used. robust" 
> summod1lScoef 
coef exp(coef) se(coef) vA RP 
spontaneous 1.9859 7.2854 0.35244 5.6346 1.8e-08 
induced 1.4090 4.0919 0.36071 3.9062 9.4e-05 
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Table 2.28 Common ways of operating on a list 





Function _ Description Try these examples 





lapply Applies a function to each x <- list(1:5, 6:10) 

component of a list and returns a list x 
lapply (x, mean) 

sapply Applies a function to each sapply(x, mean) 
component of a list and simplifies 

do.call Calls and applies a function to the do.call(rbind, x) 
list 

mapply Applies a function to the first 
elements of each argument, the 
second elements, the third elements, 
and so on. 


<- list(1:4, 1:4) 


<- list (4, rep(4, 4)) 


Kx x Xx 


mapply (rep, xX, y, 
SIMPLIFY=FALSE) 


mapply(rep, x, y) 


2.5.5 Replacing lists components 


Replacing list components is accomplished by combining indexing with assignment. 
And of course, we can index by position, name, or logical. Remember, if it can be 
indexed, it can be replaced. Study and practice the examples in Table 2.27 on the 
preceding page. 


2.5.6 Operations on lists 


Because lists can have complex structural components, there are not many opera- 
tions we will want to do on lists. When we want to apply a function to each compo- 
nent (bin) of a list, we use the lapply or sapply function. These functions are 
identical except that sapply “simplies” the final result, if possible. 

The do.cal1 function applies a function to the entire list using each each com- 
ponent as an argument. For example, consider a list where each bin contains a vector 
and we want to cbind the vectors. 


> mylist <- list (vecl=1:5, vec2=6:10, vec3=11:15) 
> cbhind(mylist) #will not work 
mylist 
vecl Integer,5 
vec2 Integer,5 
vec3 Integer,5 
> do.call(cbind, mylist) #works 
vecl vec2 vec3 
[1,] 1 6 11 
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[2,] 2 ea 
[351 3 ge. 148 
[4,] 4 9 14 
[5,] S 207°. 25 


For additional practice, study and implements the examples in Table 2.28 on the 
preceding page. 


2.6 A data frame is a list in a 2-dimensional tabular form 


A data frame is a list in 2-dimensional tabular form. Each list component (bin) is a 
data field of equal length. A data frame is a list that behaves like a matrix. Anything 
that can be done with lists can be done with data frames. Many things that can be 
done with matrices can be done with data frames. 


2.6.1 Understanding data frames and factors 


Epidemiologists are familiar with tabular data sets where each row is a record and 
each column is a field. A record can be data collected on individuals or groups. 
We usually refer to the field name as a variable (e.g., age, gender, ethnicity). Fields 
can contain numeric or character data. In R, these types of data sets are handled 
by data frames. Each column of a data frame is usually either a factor or numeric 
vector, although it can have complex, character, or logical vectors. Data frames have 
the functionality of matrices and lists. For example, here is the first 10 rows of the 
infert data set, a matched case-control study published in 1976 that evaluated 
whether infertility was associated with prior spontaneous or induced abortions. 


> data(infert) 
> str(infert) 


‘data.frame’: 248 obs. of 8 variables: 

$ education : Factor w/ 3 levels "O-5yrs",..: 1 1 
$ age : num NA 45 NA 23 35 36 23 32 21 28 
S$ parity : num 61643412412 

S$ induced >: num 1122120000 

S$ case s mavme Te ly De De ee “a 

S$ spontaneous semum, 921000002, 1 oO" 10) P20) wa, 

S$ stratum : int 2.2 34-5) 6 7-7 8. 9° 20 

S$ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 


> infert[1:10, 1:6] 

education age parity induced case spontaneous 
1 O-5yrs NA 6 1 1 2 
2 O-Syrs 45 ik E 1 0 
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O-5yrs NA 
O-5yrs 23 
6-llyrs 35 
6-llyrs 36 
6-llyrs 23 
6-llyrs 32 
6-llyrs 21 
0 6-llyrs 28 


FPoODMDNDA OA WwW 
NPNFE BWBO 
SCPOCONENDN 
PRPPRPPRP PB 
OrFCOFREFOO 


The fields are obviously vectors. Let’s explore a few of these vectors to see what 
we can learn about their structure in R. 


> #age variable 
> infertSage 
[1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31 27 30 26 


[235] 25 32 25 31 38 26 31 31 25 31 34 35 29 23 
> mode (infertSage) 

[1] "numeric" 

> class (infertSage) 

[1] "numeric" 


> #stratum variable 
> infertSstratum 
[1] Te og 36 6 Ae 6 7 8 910 11 12 13 14 15 16 17 18 


[235] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 
> mode (infertSstratum) 

[1] "numeric" 

> class (infertSstratum) 

[1] "integer" 


> #education variable 
> infertSeducation 
[1] O-5yrs O-5yrs O-5yrs O-5yrs 6-l1lyrs 6-llyrs 


[247] 12+ yrs 12+ yrs 

Levels: O-5yrs 6-llyrs 12+ yrs 
> mode (infertSeducation) 

[1] "numeric" 

> class (infertSeducation) 

[1] "factor" 


What have we learned so far? In the infert data frame, age is a vector of mode 
“numeric” and class “numeric,” st rat um is a vector of mode “numeric” and class 
“integer,” and education is a vector of mode “numeric” and class “factor.” The 
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numeric vectors are straightforward and easy to understand. However, a factor, R’s 
representation of categorical data, is a bit more complicated. 

Contrary to intuition, a factor is a numeric vector, not a character vector, although 
it may have been created from a character vector (shown later). To see the “true” 
education factor use the unclass function: 


> z <- unclass(infertSeducation) 
> Zz 
Pe}: ae Ad, Ae 2 PDD DODD DED DAD. De De Be D:D. PIPPI DADE DD. 


[244] 3.3 3. 3°'3 

attr(,"levels") 

[LJ "O-5yns" ““E=liyrs": "12+ yrs" 
> mode (z) 

[1] "numeric" 

> class (z) 

[1] "integer" 


Now let’s create a factor from a character vector and then unclass it: 














> cointoss <- sample(c("Head","Tail"), 100, replace = TRUE) 
> cointoss 
[1] "Tail" "Head" "Head" "Tail" "Tail" "Tail" "Head" 
[99] "Tail" "Head" 
> fet <- factor (cointoss) 
> £Ct 


[1] Tail Head Head Tail Tail Tail Head Head Head Tail Head 


[100] Head 
Levels: Head Tail 
> unclass(fct) 





Pap 32 a Bae 2d ay 22h os e222 DD 2 = a A Ae A de 2? 22 
(P23) 22 2-D a 2" Pea2 dd 1 Sh od. 220d he 2 2 2d TL 2B 2 2d 
[sooa)-i si fds he Dee Dither 202i 2 cals oT i Din Di Dede Qe Qed a2 le 22 
[P82]0 deale2- yd Be he Bee de D2, 2 

attr(,"levels") 


[1] "Head" "Tail" 


Notice that we can still recover the original character vector using the as. character 
function: 


> as.character(cointoss) 








[1] WT adsl "Head" "Head" "Tada" "Tail" weal "Head" 
[99] "Tail" "Head" 
> as.character (fct) 

[1] Tad i "Head" "Head" Wrap "Tail" wT aa "Head" 
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[99] "Tail" "Head" 


Okay, let’s create an ordered factor; that is, levels of a categorical variable that 
have natural ordering. For this we set ordered=TRUE in the factor function: 





> samp <- sample(c("Low", "Medium","High"), 100, replace=TRU 


> ofacl <- factor(samp, ordered=T) 
> ofacl 
[1] Low Medium High Medium Medium Medium Medium 


[99] High High 
Levels: High < Low < Medium 
> table(ofacl) #levels and labels not in natural order 
ofacl 
High Low Medium 
43 25 32 


However, notice that the ordering was done in alphabetical order which is not what 
we want. To change this, use the levels options in the factor function: 





> ofac2 <- factor(samp, levels=c("Low", "Medium", "High") , 


> ofac2 
[1] Low Medium High Medium Medium Medium Medium 


[99] High High 
Levels: Low < Medium < High 
> table (ofac2) 


ofac2 
Low Medium High 
28 39) 37 


Great — this is exactly what we want! For review, Table 2.29 on the next page 
summarizes the variable types in epidemiology and how they are represented in R. 
Factors (unordered and ordered) are used to represent nominal and ordinal categor- 
ical data. The infert data set contains nominal factors and the esoph data set 
contains ordinal factors. 


2.6.2 Creating data frames 


In the creation of data frames, character vectors (usually representing categorical 
data) are converted to factors (mode numeric, class factor), and numeric vectors are 
converted to numeric vectors of class numeric or class integer. 


wt <- c(59.5, 61.4, 45.2) 
age <- c(1l1, 9, 6) 
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Table 2.29 Variable types in epidemiologic data and their representations in R data frames 











Representations in data Representations in R 
Variable type Examples Mode Class Examples! 
Numeric 
Continuous 3.45, 2/3 numeric numeric infert$age 
Discrete 1 2535-45 8 numeric integer infert$stratum 
Categorical 
Nominal male vs. female numeric factor infert$education 
Ordinal low < medium < high numeric ordered factor esoph$agegp 





1. First load data: data (infert); data(esoph) 


Table 2.30 Common ways of creating data frames 








Function Description Try these examples 

data.frame Data frames are of x <- data.frame(id=1:2, sex=c("M","F") ) 
mode list mode (x); x 

as.data.frame Coerces data x <- matrix(1:6, 2, 3); x 
object intoadata as.data.frame (x) 
frame 

as.table Combine with x <- array(1:8, c(2, 2, 2)) 

ftable as.data.frame to dimnames (x) <-— list (Exposure=c ("Y", 
convert a fully "NE )y 
labeled array into Disease = c("Y", "N"), 
a data frame Confounder = c("Y", "N")) 


as.data.frame(ftable(x) ) 
as.data.frame(as.table (x) ) 


read.table Reads ASCII text wegs <- read.csv(".../wegs.txt", 
read.csv file into data header=T) 

read.delim frame object! str (wcgs) 

read.fmf 





1. Try read. csv ("http://www.medepi.net/data/wcegs.txt", header=TRUE) 


sex <- c("Male", "Male", "Female") 
df <- data.frame(age, sex, wt) 

df 

str (df) 


Factors can also be created directly from vectors as described in the previous section. 


2.6.3 Naming data frames 


Everything that applies to naming list components (Table 2.25 on page 80) also 
applies to naming data frame components (Table 2.31 on the next page). In general, 
we may be interested in renaming variables (fields) or row names of a data frame, 
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Table 2.31 Common ways of naming data frames 





Function Try these examples 





names x <- data.frame(varl = 1:3, var2 = c("M", "F", "F")) 
x 
names (x) <- c("Subjno", "Sex") 
x 
row.names row.names(x) <- c("Subj 1", "Subj 2", "Subj 3") 
x 





or renaming the levels (possible values) for a given factor (categorical variable). For 
example, consider the Oswego data set. 


> odat <- read.table("http://www.medepi.net/data/oswego.txt", 
































+ sep="", header=TRUE, na.strings=".") 
> odat[1:5,1:8] #Display partial data frame 

id age sex meal.time ill onset.date onset.time baked.ham 
1 2 52 F 8:00 PM Y 4/19 12:30 AM Y 
22” 32. 365 M 6:30 PM x 4/19 12:30 AM Y 
3 4 59 F 6:30 PM Y 4/19 12:30 AM ¥ 
4 6 63 F 7:30 PM x 4/18 10:30 PM Y 
5 7 70 M 7:30 PM Y 4/18 10:30 PM Y 
> names(odat) [3] <- "Gender" #Rename ’sex’ to ’Gender’ 
> table (odatS$SGender) #Display ’Gender’ distribution 
F M 
44 31 
> levels (odat$Gender) #Display ’Gender’ levels 
[i] Yee eM" 
> #Replace ’Gender’ level labels 
> levels(odat$Gender) <- c("Female", "Male") 
> levels (odat$Gender) #Display new ’Gender’ levels 
[1] "Female" "Male" 
> table (odatS$SGender) #Confirm distribution is same 
Female Male 

44 31 

> odat[1:5,1:8] #Display partial data frame 

id age Gender meal.time ill onset.date onset.time baked.ham 
1 2 52 Female 8:00 PM Y 4/19 12:30 AM ¥ 
2 G3. 165 Male 6:30 PM ¥ 4/19 12:30 AM Y 
3 4 59 Female 6:30 PM ¥. 4/19 12:30 AM 4 
4 6 63 Female 7:30 PM Y 4/18 10:30 PM Y 
Bh EO Male 7:30 PM Y 4/18 10:30 PM Y 
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Table 2.32 Common ways of indexing data frames 





Indexing Try these examples 





By position data (infert) 
infert[1:5, 1:3] 
By name infert[1:5, c("education", "age", "parity") ] 
By logical agelt30 <- infertSage<30; agelt30 
infert[agelt30, c("education", "induced", "parity") ] 
# can also use ‘subset’ function 
subset (infert, agelt30, 
c("education", "induced", "parity") ) 





On occasion, we might be interested in renaming the row names. Currently, the 
Oswego data set has default integer values from | to 75 as the row names. 


> row.names (odat) 

Pp LI 2 SS AN OMG OMG IE NB MOM BLOM FLL 
D2 4M LO PALS EL A Oe LG LEE MLB LO NZ OM OQ 22M 
23) M23" M2ZAT M25" F266" M27 B28" B29 MIO" MZTM MQ2" 33" 
34] "34" "35" "367 "ST" "SB" 839" "AQT MATE MAQZM haan Nagn 
45] "45" "46" "47" "4gn "4gnm M50" M57" "52" "53" "54M M55" 
S56]. MSE MST BOB USOT NE OM EMOTE ONG ZI «630 WO AN IG SM oh 6.6" 
CL) SEGA BGG arg OM NTO EE MED! ET ERA TSM 


We can change the row names by assigning a new character vector. 


> row.names(odat) <- sample(101:199, size=nrow(odat) ) 
> odat[1:5,1:7] 
id age Gender meal.time ill onset.date onset.time 


123 2 52 Female 8:00 PM ¥ 4/19 12:30 AM 
145 3 65 Male 6:30 PM Y 4/19 12:30 AM 
173 4 59 Female 6:30 PM Y 4/19 12:30 AM 
138 6 63 Female 7:30 PM Y 4/18 10:30 PM 
146 7 70 Male 7:30 PM Y 4/18 10:30 PM 





2.6.4 Indexing data frames 


Indexing a data frame is similar to indexing a matrix or a list: we can index by 
position, by name, or by logical vector. Consider, for example, the 2004 Califor- 
nia West Nile virus human disease surveillance data. Suppose we are interested in 
summarizing the Los Angeles cases with neuroinvasive disease (“WNND’”). 


> wdat <- read.csv("http://www.medepi.net/data/wnv/wnv2004fin.txt") 
> str (wdat) 

‘data.frame’: 779 obs. of 8 variables: 

S id : int 12345678 9 10 
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S$ county : Factor w/ 23 levels "Butte","Fresno",..: 14 
S$ age : int 40 64 19 12 12 17 61 74 71 26 
S sex : Factor w/ 2 levels "F","M": 11222221... 
S syndrome : Factor w/ 3 levels "Unknown", "WNEF",..: 22 3 
S$ date.onset : Factor w/ 130 levels "2004-05-14", ..: 3 
S$ date.tested: Factor w/ 104 levels "2004-06-02",..: 1... 
S$ death : Factor w/ 2 levels "No","Yes": 111314141 
> levels(wdatScounty) #Review levels of ’county’ variable 
[1] "Butte" "Fresno" "Glenn" 
[4] "Imperial" "Kern" "Lake" 
[7] "Lassen" "Los Angeles" "Merced" 
[10] "Orange" "Placer" "Riverside" 
[13] "Sacramento" "San Bernardino" "San Diego" 
[16] "San Joaquin" "Santa Clara" "Shasta" 
[19] "Sn Luis Obispo" "Tehama" "Tulare" 
[22] "Ventura" "Yolo" 
> levels(wdatSsyndrome) #Review levels of ’syndrome’ variable 
[1] "Unknown" "WNE" "WNND" 
> myrows <- wdatScounty=="Los Angeles" & wdat$syndrome=="WNND" 
> mycols <- c("id", "county", "age", "sex", "syndrome", "death") 
> wnv.la <- wdat[myrows, mycols] 
> wnv.la 
id county age sex syndrome death 
25 25 Los Angeles 70 M WNND No 
26 26 Los Angeles 59 M WNND No 
27 27 Los Angeles 59 M WNND No 
734 736 Los Angeles 71 M WNND Yes 
770 772 Los Angeles 72 M WNND No 
776 778 Los Angeles 50 EF WNND No 








In this example, the data frame rows were indexed by logical vector, and the columns 
were indexed by names. We emphasize this method because it only requires appli- 
cation of previously learned principles that always work with R objects. 

An alternative method is to use the subset function. The first argument speci- 
fies the data frame, the second argument is a Boolean operation that evaluates to a 
logical vector, and the third argument specifies what variables (or range of variables) 
to include or exclude. 


> wnv.sf2 <- subset (wdat, county=="Los Angeles" & syndrome=="WNND", 


+ select = c(id, county, age, sex, syndrome, 
> wnv.sf2[1:6, 
id county age sex syndrome death 
25 25 Los Angeles 70 M WNND No 
26 26 Los Angeles 59 M WNND No 
27 27 Los Angeles 59 M WNND No 








Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


death) ) 


92 


47 
48 
49 


This example is equivalent but specifies range of variables using the : 


> wnv. 


+ 


> wnv. 


25) 
26 
27 
47 
48 
49 


47 
48 
49 


id 
29 
26 
27 
47 
48 
49 


LOS 
LOS 





LOS 


sf3 


sf3[1:6, 


LOS 
LOS 


LOS 
LOS 
LOS 





LOS 


Angeles 
Angeles 
Angeles 





57 
60 
34 


M 
M 
M 


<-— subset (wdat, 


select 


WNND 
WNND 
WNND 
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No 
No 
No 


function: 


county=="Los Angeles" & syndrome=="WNND", 


c(id:syndrome, 


county age sex syndrome death 
Angeles 
Angeles 


Ange] 
Ange] 
Ange] 
Angel 





LES 
LES 
LES 
LES 


70 
59 
59 
57 
60 
34 


M 


See 


M 


WNND 
WNND 
WNND 
WNND 
WNND 
WNND 


No 
No 
No 
No 
No 
No 


death) ) 


This example is equivalent but specifies variables to exclude using the — function: 


> wnv.sf4 <- 


+ 


> wnv. 


25 
26 
27 
47 
48 
49 


id 
25 
26 
27 
47 
48 
49 


sf4[1:6, 


LOS 
LOS 
LOS 
LOS 
LOS 





LOS 


subset (wdat, 
select 


county=="Los Angeles" & syndrome=="WNND", 


-—c(date.onset, 


county age sex syndrome death 


Angel 
Angel 
Angel 
Angel 
Angel 
Angel 





Les 
Les 
Les 
Les 
Les 
Les 


70 
59 
59 
Sf 
60 
34 


M 


See 8 


M 


WNND 
WNND 
WNND 
WNND 
WNND 
WNND 


No 
No 
No 
No 
No 
No 


date.tested) ) 


The subset function offers some conveniences such as the ability to specify a 


range of fields to include using the : 
exclude using the — function. 


2.6.5 Replacing data frame components 


function, and to specify a group of fields to 


With data frames, as with all R data objects, anything that can be indexed can be 
replaced. We already saw some examples of replacing names. For practice, study 
and implement the examples in Table 2.33 on the facing page. 
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Table 2.33 Common ways of replacing data frame components 





Replacing Try these examples 





By position data (infert) 
infert[1:4, 1:2] 
infert[1:4, 2] <- c(NA, 45, NA, 23) 
infert[1:4, 1:2] 


By name names (infert) 
infert[1:4, c("education", "age") ] 
infert[1:4, c("age")] <- c(NA, 45, NA, 23) 
infert[1:4, c("education", "age") ] 

By logical table (infertS$parity) 
# change values of 5 or 6 to missing (NA) 
infert$parity[infert$parity==5 | infert$parity==6] 
<- NA 


table (infertSparity) 
table(infertSparity, exclude=NULL) 





Table 2.34 Common ways of operating on a data frame 








Function Description Try these examples 
tapply Apply a function to data (infert) 
strata of a vector that args (tapply) 
are defined by a tapply(infertS$age, infertSeducation, 


unique combination of mean, na.rm = TRACE) 
the levels of selected 
factors 
lapply Apply a function to lapply (infert[,1:3], table) 
each component of the 
list 


sapply Apply a function to sapply(infert[,c("age", "parity")], 
each component of a mean, na.rm = TRUE) 
list, and simplify 

aggregate Splits the data into aggregate (infert[,c("age", "parity")], 
subsets, computes by = list (Education = 
summary statistics for infert$education, 
each, and returns the Induced = infertSinduced), mean) 
result in a convenient 
form. 

mapply Apply a function to df <- data.frame(varl = 1:4, var2 = 


the first elements of 4:1) 

each argument, the mapply("«*", dfSvarl, dfSvar2) 

second elements, the mapply(c, df$varl, df$var2) 

third elements, andso mapply(c, df$varl, df$var2, SIMPLIFY=F) 
on. 
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2.6.6 Operations on data frames 


A data frame is of mode list, and functions that operate on components of a list will 
work with data frames. For example, consider the California population estimates 
and projections for the years 2000-2050. 


> capop <- read.csv("http://www.dof.ca.gov/HIML/DEMOGRAP / 
Data/RaceEthnic/Population-00-50/documents/California.txt") 








> str (capop) 


‘data.frame’: 10302 obs. of 11 variables: 

S County : int 59 59 59 59 59 59 59 59 59 59 
S$ Year : int 2000 2000 2000 2000 2000 2000 
S Sex : Factor w/ 2 levels "F","M": 111 
S Age : int 0123456789 

S$ White : int 75619 76211 76701 78551 82314 
S$ Hispanic : int 115911 113706 114177 116733 0 
S$ Asian : int 20879 20424 21044 21920 22760 
S$ Pacific.Islander: int 741 765 806 817 884 945 961 

S$ Black >: int 14629 15420 15783 16531 17331 
S$ American.Indian : int 1022 1149 1169 1318 1344 1363 
S$ Multirace : int 10731 8676 8671 8556 8621 
> capop[1:5,1:8] 

County Year Sex Age White Hispanic Asian Pacific.Islander 





1 59 2000 F 0 75619 115911 20879 
2 59 2000 F 1 76211 113706 20424 
3 59 2000 F 2 76701 114177 21044 
4 59 2000 F 3 78551 116733 21920 
i) 59 2000 EF 4 82314 119995 22760 


Now, suppose we want to assess the range of the numeric fields. If we treat the data 
frame as a list, both Llapply or sapply works: 


> sapply(capop[-3], range) 


TA4l 
765 
806 
817 
884 


County Year Age White Hispanic Asian Pacific.Islander 


Ele 59 2000 0 497 110 76 
[2,] 59 2050 100 148246 277168 46861 
Black American.Indian Multirace 
[lye] a1 0 5 
[2,] 26983 8181 17493 


However, if we treat the data frame as a matrix, apply also works: 


> apply(capop[,-3], 2, range) 


1 
1890 


County Year Age White Hispanic Asian Pacific.Islander 


bay J 59 2000 0 497 110 76 
[2,] 59 2050 100 148246 277168 46861 
Black American.Indian Multirace 
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269 


57 
83 


0 
8181 


1749 


5 
3 


95 


Some R functions, such as summary, will summarize every variable in a data 
frame without having to use lapply or sapply. 


> summary (capop[1:7]) 








County Year Sex Age 
Min. #59 Min. :2000 F:5151 Min. 2 110 
Ist Qu.:59 lst Qu.:2012 M:5151 Ist. Qu.: 25 
Median :59 Median :2025 Median 50 
Mean 759 Mean 22025 Mean 50 
3rd Qu.:59 3rd Qu.:2038 3rd Qu.: 75 
Max. 259 Max. :2050 Max. :100 

White Hispanic Asian 
Min. 497 Min. 110 Min. 76 
lst Qu.: 63134 lst Qu.: 29962 lst Qu.:19379 
Median 75944 Median :115646 Median :30971 
Mean : 71591 Mean 2101746 Mean 227769 
3rd Qu.: 88021 3rd Qu.:154119 3rd Qu.:37763 
Max. 2148246 Max. 2277168 Max. :46861 


2.6.6.1 The aggregate function 


The aggregate function is almost identical to the tapply function. Recall that 
tapply allows us to apply a function to a vector that is stratified by one or more 
fields; for example, calculating mean age (1 field) stratified by sex and ethnicity (2 
fields). In contrast, aggregate allows us to apply a function to a group of fields 
that are stratified by one or more fields; for example, calculating the mean weight 
and height (2 fields) stratified by sex and ethnicity (2 fields): 








> sex <— c("M", "M"™, "M", 
> eth <- c("w", "W", "B", 
> wgt <- c(140, 150, 150, 
> hgt <- c( 60, 70, 70, 
> df <- data.frame(sex, et 
> aggregate(df[, 3:4], by 
+ Ethnicity = df 
Gender Ethnicity wgt hgt 
al F B 135 55 
2 M Be L5oS*. 75 
3 EF W125 45 
4 M W145 65 


UME TE) EN ES, EM) 
"By mW, "Wh, "BM, "BMT 
160, 120, 130, 130, 140) 
80, 40, 50, 50, 60) 

h, wgt, hgt) 

= list (Gender = dfSsex, 

Seth), FUN = mean) 


For another example, in the capop data frame, we notice that the variable age 
goes from 0 to 100 by 1-year intervals. It will be useful to aggregate ethnic-specific 
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population estimates into larger age categories. More specifically, we want to cal- 
culate the sum of ethnic-specific population estimates (6 fields) stratified by age 
category, sex, and year (3 fields). We will create a new 7-level age category field 
commonly used by the National Center for Health Statistics. Naturally, we use the 
aggregate function: 


> capop <- read.csv("http://www.dof.ca.gov/HTML/DEMOGRAP/Data/ 





Race 

> to.keep <- c("White", "Hispanic", "Asian" 
+ "Black", "American.Indian", 
> age.nchs7 <- c(0, 1, 5, 15, 25, 45, 65, 1 
> capopSagecat7 <- cut (capopSAge, breaks = 
> capop7 <- aggregate(capop[,to.keep], by = 
+ Sex=capopSSex, Year=c 
> levels (capop7SAge) [7] <- "65+" 
> capop7[1:14, 1:6] 

Age Sex Year White Hispanic Asian 
1 [0,1) F 2000 75619 115911 20879 
2 [1,5) F 2000 313777 464611 86148 
3 [57 15) F 2000 924930 1124573 241047 
4 [15,25) F 2000 868767 946948 272846 
5 [25,45) F 2000 2360250 1742366 667956 
6 [45,65) F 2000 2102090 735062 445039 
7 65+ F 2000 1471842 279865 208566 
8 [0,1) M 2000 79680 121585 21965 
9 [1,5) M 2000 331193 484068 91373 
10. 5, 15) M 2000 979233 1175384 257574 
11 [15,25) M 2000 925355 1080868 279314 
12 [25,45) M 2000 2465194 1921896 614608 
13 [45,65) M 2000 2074833 687549 384011 
14 65+ M 2000 1075226 202299 154966 
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, “Pacific.Islander", 
"Multirace") 
01) 


age.nchs7, right=FALS 


list (Age=capopSageca 
apopSYear), FUN = sum 


When we work in R we have a workspace. Think of the workspace as our desktop 
that contains the “objects” (data and tools) we use to conduct our work. To view the 
objects in our workspace use the 1s or objects functions (Table 2.35 on the next 


page): 


> 


Is () 

[1] "add.to.x" 
[5] "agecat" 
[9] "capop7" 


[13] "dat2" 
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"add.to 
We ite W 

WwW dat Ww 

W dat3 wW 


14-Oct-2013 


" 


aye 


age" "age.nchs7" 
"ar.num" "capop" 
"dat.ordered" "dat.random" 
dat4" wW dd" 


" 
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Ethnic/Population-—00-50/documents/California.txt") 





E) 
ET 
) 
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Table 2.35 Common ways of managing data objects 








Function Description Try these examples 
Is List objects 1s () 
objects objects() #equivalent 
rm Remove object(s) yy <- 1:5; 1s() 
remove rm(yy); 1s () 
# remove in objects in working 
environment 
# Don’t do this unless we are really 
sure 
rm(list = 1s()) 
save.image Saves current save.image () 
workspace 
save Writes R objects to the x <- runif (20) 
load specified external file. y <- list(a = 1, b = TRUE, c = "oops") 
The objects can be save(x, y, file = "c:/temp/xy.Rdata") 
read back from the file rm(x,y); x; y 
at a later date using load(file = "c:/temp/xy.Rdata") 
*load’ x 
y 





We use the pattern option to search for object names that contain the pattern we 
specify. 


> ls(pattern = "dat") 
[LJ “dat™ "dat.ordered" "dat.random" "dat2" 
[5] "dat3" "dat4" "mydat" "sdat" 
[9] "sdat.asr" "sdat3" "st.dates" "udati" 
[13] "udat2" "wdat" 


The rm or remove functions will remove workspace objects. 


> rm(dat, dat2, dat3, dat4) 
> ls (patt="dat") 


[1] "dat.ordered" "dat.random" "mydat" "sdat" 
[5] "sdat.asr" "sdat3" "st.dates" "udatl" 
[9] "udat2" "wdat" 


To remove all data objects use the following code with extreme caution: 
rm(list = ls()) 


However, the object names may not be sufficiently descriptive to know what 
these objects contain. To assess R objects in our workspace we use the functions 
summarized in Table 2.2 on page 30. In general, we never go wrong using the st r, 
mode, and class functions. 


> mode (capop7) 
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ji] “Last* 

> class (capop7) 
[1] "data.frame" 
> str (capop7) 
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‘data.frame’: 714 obs. of 10 variables: 
S Age : Factor w/ 7 levels "[0,1)",..: 1 2 6 
S Sex Factor w/ 2 levels "F","M": 1111 
S Year Factor w/ 51 levels "2000",..: 111 
S$ White int 75619 313777 924930 868767 
S$ Hispanic int 115911 464611 1124573 946948 
S Asian : int 20879 86148 241047 272846 
S$ Pacific.Islander: int 741 3272 9741 9629 19085 9898 
S Black int 14629 65065 195533 158923 
S American.Indian int 1022 4980 15271 14301 30960 
S$ Multirace int 10731 34524 78716 56735 82449 
> 


> mode (orcalc) 
[1] "function" 
> class(orcalc) 
[1] "function" 
> str (orcalc) 
function (x) 


—- attr(*, "source")= chr [1:5] "function(x) {" 


> orcalc 
function (x) { 


or <- (x[1,1]*x[2,2])/(x[1,2]*x[2,1]) 
pval <- fisher.test (x) $p.value 
list (data = x, odds.ratio = or, p.value = pval) 


Objects created in the workspace are available during the R session. Upon closing 
the R session, R asks whether to save the workspace. To save the objects without 
exiting an R session, use the save . image function: 


> save.image() 


The save. image function is actually a special case of the save function: 


save(list = ls(all = 


TRUI 





E), file = ".RData") 


The save function saves an R object as an external file. This file can be loaded 


using the Load function. 


> x <- 1:5 
> X 
[1] 12345 


> save(x, file="/home/tja/temp/x") 


> rm(x) 
> x 
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Table 2.36 Assessing and coercing data objects 


99 








Query data type Coerce to data type 
is.vector as.vector 

is.matrix as.matrix 

is.array as.array 

is.list as. list 


is.data.frame 
is.factor 
is.ordered 
is.table 


is.numeric 
is.integer 
is.character 


is.logical 
is.function 
is.null 
is.na 
is.nan 


is.finite 
is.infinite 


as.data.frame 
as.factor 
as.ordered 
as.table 


as.numeric 
as.integer 
as.character 


as.logical 
as.function 
as.null 

n/a 

n/a 


n/a 
n/a 








Error: object "x" 


not found 


> load (file="/home/tja/temp/x") 


> X 


yi dt. 2 3e-4e-5 


Table 2.36 provides more functions for conducting specific object queries and for 
coercing one object type into another. For example, a vector is not a matrix. 


> is.matrix(1:3) 
[1] FALSE 


However, a vector can be coerced into a matrix. 


> as.matrix(1:3) 


[,1] 
[1,] 1 
[2,] 2 
[35] 3 
> is.matrix(as.matrix(1:3)) 
[1] TRUE 


A common use would be to coerce a factor into a character vector. 
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> sex <- factor(c("M", "M", "M", "M", "RM ww mpm mp) y 
> SEX 

[1] MMMMPFFFF 

Levels: F M 

> unclass (sex) #does not coerce into character vector 
fae}. 222) 25°72 > ded ed 

attr(,"levels") 

fl]. “Er "M" 

> as.character(sex) #yes, works 

[a]! ME EME AE EM oe 


In R, missing values are represented by the value NA (“not available”). The 
is.na function evaluates an object and returns a logical vector indicating which 
positions contain NA. The ! is .na version returns positions that do not contain NA. 


> x <- c(12, 34, NA, 56, 89) 
> is.na(x) 
[1] FALSE FALS 
> !is.na(x) 
[1] TRUE TRUE FALSE TRUE TRUE 





Tr 


TRUE FALSE FALSE 























We can use is.na to replace missing values. 


> x[is.na(x)] <- 999 
Dox 
[1] 12 34 999 56 89 


In R, NaN represents “not a number” and Inf represent an infinite value. There- 
fore, we can use is.nan and is.infinite to assess which positions contain 
NaN and Inf, respectively. 











Sox << -e'(O, 3, 0, —6) 

> y <- c(4, 0, 0, 0) 

> z <- x/y 

> Zz 

[1] 0 Inf NaN -Inf 

> is.nan(z) 

[1] FALSE FALSE TRUE FALSE 
> is.infinite(z) 

[1] FALSE TRUE FALSE TRUE 

















2.8 Managing our workspace 


Our workspace is like a desktop that contains the “objects” (data and tools) we use 


to conduct our work. Use the get wd function to list the file path to the workspace 
file .RData. 
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> getwd() 
[1] "/home/tja/Data/R/home" 


Use the set wd function to set up a new workspace location. A new .RData file 
will automatically be created there 


setwd("/home/t ja/Data/R/newproject") 


This is one method to manage multiple workspaces for one’s projects. 
Use the search function to list the packages, environments, or data frames 
attached and available. 


search() # Linux 





> 
[1] ".GlobalEnv" "package:stats" "package: graphics" 
[4] "package:grDevices" "package:utils" "package:datasets" 
[7] “package:methods" "Autoloads" "package:base" 


The global environment .GlobalEnv is our workspace. The searchpaths 
function list the full paths: 


searchpaths () 






































> 
[1] ".GlobalEnv" "/usr/lib/R/library/stats" 

[3] "/usr/lib/R/library/graphics" "/usr/lib/R/library/grDevices" 
[5] "/usr/lib/R/library/utils" "/usr/lib/R/library/datasets" 
[7] "/usr/lib/R/library/methods" "Autoloads" 

[9] "/usr/lib/R/library/base" 
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Table 2.37 Risk of Death in a 20-year Period Among Women in Whickham, England, According 
to Smoking Status at the Beginning of the Period 


Smoking 
Vital Status Yes No 
Dead 139 230 
Alive 443 502 


Table 2.38 Risk of Death in a 20-year Period Among Women in Whickham, England, According 
to Smoking Status at the Beginning of the Period 


Smoking 
Vital Status Yes No Total 
Dead 139 230 369 
Alive 443 502 945 
Total 582 732 1314 


Table 2.39 Risk Ratio and Odds Ratio of Death in a 20-year Period Among Women in Whickham, 
England, According to Smoking Status at the Beginning of the Period 
Smoking 
Yes No 
Risk 0.24 0.31 
Risk Ratio 0.76 1.00 
Odds 0.31 0.46 
Odds Ratio 0.68 1.00 


Problems 


2.1. Download and install RStudio from http://www.rstudio.org. Start 
a new project. For example, I started a new project in the following directory: 
/home/tja/Documents/courses/ph251d/Rproj. Inside the Rproj directory I re- 
named the XXX.Rproj file to ph251d.Rproj. From the main menu select File > 
New -> R Script. Save the R script as ph251d.R in the Rproj directory. Write and 
run R code from this script. Test R code from the tables in this chapter. 


2.2. Names and describe the six type of R objects. 
2.3. How many ways can we index an object? 


2.4. Finish the sentence: Any R object component(s) that can be indexed, can be 


2.5. Recreate Table 2.37 using any combination of the matrix, cbind, rbind, 
dimnames, or names functions. 


2.6. Starting with the 2 x 2 matrix object we created previously, using only the 
apply, cbind, rbind, names, and dimnames functions, recreate Table 2.38. 
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2.7. Using the 2 x 2 data from Table 2.37 on the preceding page, use the sweep and 
apply functions to calculate marginal and joint distributions. 


2.8. Using the data from the previous problems, recreate Table 2.39 on the previous 
page and interpret the results. 


2.9. Read in the Whickham, England data using the R code below. 


wdat = read.table("http://www.medepi.net/data/whickham-engl.txt", 
sep = ",", header = TRUE) 





str (wdat) 
xtabs("“Vital.Status + Age + Smoking, data = wdat) 


Stratified by age category, calculate the risk of death comparing smokers to non- 
smokers. Show your results. What is your interpretation. 


2.10. Use the read. table function to read in the syphilis data available at http: 
//www.medepi.net/data/syphilis89c.txt. Evaluate structure of data 
frame. Do not attach std data frame (yet). Create a 3-dimensional array using both 
the table or xt abs function. Now attach the std data frame using the attach 
function. Create the same 3-dimensional array using both the table or xtabs 
function. 


2.11. Use the apply function to get marginal totals for the syphilis 3-dimensional 
array. 


2.12. Use the sweep and apply functions to get marginal and joint distributions 
for a 3-D array. 


2.13. Review and read in the group-level, tabular data set of primary and secondary 
syphilis cases in the United States in 1989 available at http: //www.medepi. 
net/data/syphilis89b.txt. Use the rep function on the data frame fields 
to recreate the individual-level data frame with over 40,000 observations. 


2.14. Working with population estimates can be challenging because of the amount 
of data manipulation. Study the 2000 population estimates for California Counties: 
http://www.medepi.net/data/calpop/CalCounties2000.txt. Now, 
study and implement this R code. For each expression or group of expressions, ex- 
plain in words what the R code is doing. Be sure to display intermediate objects to 
understand each step. 


#1 Read county names 
cty <- scan("http://www.medepi.net/data/calpop/calcounty.txt", 
what="") 


#2 Read county population estimates 

calpop = 

read.csv ("http://www.medepi.net/data/calpop/CalCounties2000.txt", 
header = T) 
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#2 Replace county number with county name 
for(i in l:length(cty) ) { 
calpopSCounty [calpop$County==i] <- cty[il] 





#3 Discretize age into categories 
calpopSAgecat <- cut (calpopSAge, c(0,20,45,65,100), 
include.lowest = TRUE, right = 





FALS! 





GI 
—_ 


#4 Create combined API category 
calpopSAsianPI <- calpopSAsian + calpop$Pacific.Islander 


#5 Shorten selected ethnic labels 
names (calpop) [c(6, 9, 10)] = c("Latino", "AfrAmer", "AmerInd") 


#6 Index Bay Area Counties 


baindex <- calpopSCounty=="Alameda" | calpopSCounty=="San Francisco" 
bapop <- calpop[baindex, ] 
bapop 


#7 Labels for later use 

agelabs <- names (table (bapopSAgecat) ) 

sexlabs <- c("Female", "Male") 

racen <- c("White", "AfrAmer", "AsianPI", "Latino", "Multirace", 
"AmeriInd") 

ctylabs <- names (table (bapopSCounty) ) 





#8 Aggregate 
bapop2 <- aggregate (bapop[,racen], 
list (Agecat = bapopSAgecat, Sex = bapopS$Sex, 
County = bapopSCounty), sum) 
bapop2 


#9 Temp matrix of counts 
tmp <- as.matrix(cbind(bapop2[1:4,racen], bapop2[5:8,racen], 
bapop2[9:12,racen], bapop2[13:16,racen]) ) 


#10 Convert into final array 

bapop3 <- array(tmp, c(4, 6, 2, 2)) 

dimnames (bapop3) <- list (agelabs, racen, sexlabs, ctylabs) 
bapop3 
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CHAPTER O 





Managing epidemiologic data in R 





3.1 Entering and importing data 


There are many ways of getting our data into R for analysis. In the section that 
follows we review how to enter the Unversity Group Diabetes Program data (Ta- 
ble 3.1) as well as the original data from a comma-delimited text file. We will use 
the following approaches: 


e Entering data at the command prompt 
e Importing data from a file 
e Importing data using an URL 


3.1.1 Entering data at the command prompt 


We review four methods. For Methods 1| and 2, data are entered directly at the com- 
mand prompt. Method 3 uses the same R expressions and data as Methods 1 and 


Table 3.1 Deaths among subjects who received tolbutamide and placebo in the Unversity Group 
Diabetes Program (1970), stratifying by age 











Age<55 Age>55 Combined 
Tolbutamide Placebo Tolbutamide Placebo Tolbutamide Placebo 
Deaths 8 5 22 16 30 21 
Survivors 98 115 76 69 174 184 
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2, but they are entered into a text editor, saved as an text file with a .R extension 
(e.g., job02.R), and then executed from the command prompt using the source 
function. Alternatively, the R expressions and data can be copied and pasted into 
R.! And, for Method 4 we use R’s spreadsheet editor (least preferred). 


3.1.1.1 Method 1 


For review, a convenient way to enter data at the command prompt is to use the c 
function: 


> #enter data for a vector 

> vecl <- c(8, 98, 5, 115); vecl 

[1] 8 98 5y 115 

> vec2 <- c(22, 76, 16, 69); vec2 

[1] 22 76 16 69 

> 

> #enter data for a matrix 

> mtxl <- matrix(vecl, 2, 2); mtxl 
lel) ty2] 

Et] 8 5 

[2,] 98 115 

> mtx2 <- matrix(vec2, 2, 2); mtx2 
ete er 

[1,] 22 16 

[2,] 76 69 

> 


> #enter data for an array and sum across strata 
> udat <- array(c(vecl, vec2), c(2, 2, 2)); udat 


[,1] [,2] 
[1,] 8 5 
[2,] 98 115 


[,1] [,2] 
Ea 22 16 
[2,] 76 69 


> udat.tot <- apply(udat, c(1, 2), sum); udat.tot 


[,1] [,2] 
[1,] 30 21 


! In RStudio, select from main menu Code > Run Region, etc. 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


3.1 Entering and importing data 


[2,] 174 184 
> 
> #enter a list 
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> x <- list (crude.data = udat.tot, stratified.data = udat) 


> xScrude.data 

[eho by2 
[1,] 30 21 
[2,] 174 184 
> xSstratified 
Pe ip oe 


#fenter simple data frame 


subjname <- c("Pedro", "Paulo", 


age <- c(34, 56, 56) 


"Maria") 


sex <- c("Male", "Male", "Female") 
dat <- data.frame(subjno, subjname, age, sex); dat 


> 
> 
> 
> subjno <- 1:length(subjname) 
> 
> 
> 





subjno subjname age sex 
1 Pedro 34 Male 
2 Paulo 56 Male 
3 Maria 56 Female 








#fenter a simple function 


1] 1.510673 


3.1.1.2 Method 2 


odds.ratio <- function(aa, bb, 
odds.ratio(30, 174, 21, 184) 


cc, dd){ aaxdd / (bbx*cc) } 


Method 2 is identical to Method | except one uses the scan function. It does not 
matter if we enter the numbers on different lines, it will still be a vector. Remember 
that we must press the Enter key twice after we have entered the last number. 


> udat.tot <- scan() 
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1: 30 174 
3: 21 184 
oO: 


Read 4 items 
> udat.tot 
[1] 30 174 21 184 


To read in a matrix at the command prompt combine the matrix and scan 
functions. Again, it does not matter on what lines we enter the data, as long as they 
are in the correct order because the mat rix function reads data in column-wise. 


> udat.tot <- matrix(scan(), 2, 2) 
1: 30 174 21 184 
a 
Read 4 items 
> udat.tot 
ke 2d 2] 
[1,] 30 21, 
[25] 174 184 


> udat.tot <- matrix(scan(), 2, 2, byrow = T) #read row-wise 
1: 30 21 174 184 
mit 
Read 4 items 
> udat.tot 
ke 2] 
[1,] 30 DAL 
[25°] 174 184 


To read in an array at the command prompt combine the array and scan func- 
tions. Again, it does not matter on what lines we enter the data, as long as they are 
in the correct order because the array function reads the numbers column-wise. In 
this example we include the dimnames argument. 


> udat <- array(scan(), dim = c(2, 2, 2), 

+ dimnames = list (Vital.Status = c("Dead","Survived"), 

+ Treatment = c("Tolbutamide", "Placebo"), 
+ Age.Group = c("<55", "55+"))) 

1: 8 98 5 115 22 76 16 69 

9 


Read 8 items 
> udat 
, , Age.Group = <55 


Treatment 


Vital.Status Tolbutamide Placebo 
Dead 8 5 
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Survived 98 TLS 


, , Age.Group = 55+ 


Treatment 
Vital.Status Tolbutamide Placebo 
Dead 22 16 
Survived 76 69 


To read in a list of vectors of the same length (“fields”) at the command prompt 
combine the 1ist and scan function. We will need to specify the type of data that 
will go into each “bin” or “field.” This is done by specifying the what argument as 
a list. This list must be values that are either logical, integer, numeric, or character. 
For example, for a character vector we can use any expression, say x, that would 
evaluate to TRUE for is.character (x). For brevity, use "" for character, 0 for 
numeric, 1: 2 for integer, and T or F for logical. Look at this example: 





> dat <- scan("", what = list(1:2, "", 0, "", T)) 
1: 3 "John Paul" 84.5 Male F 

2: 4 "Jane Doe" 34.5 Female T 

3 

Read 2 records 

> dat 

{1 
1] 3 4 





[2 
"John Paul" "Jane Doe" 


1] 8455 34.5 


1] "Male" "Female" 








[5 
1] FALSE TRUI 











Gl 





> str (dat) 
List of 5 
Siy ant: Pie)! 34 
S$ chr [1:2] "John Paul" "Jane Doe" 
$ num [1:2] 84.5 34.5 
S$ chr [1:2] "Male" "Female" 
S$ logi [1:2] FALSE TRUE 
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> #same example with field names 

> dat <- scan("", what = list(id = 1:2, name = "", age = 0, 
+ sex = "", dead TRUE) ) 

1: 3 "John Paul" 84.5 Male F 

2: 4 "Jane Doe" 34.5 Female T 

3:3 

Read 2 records 

> dat 

Sid 

A] 3 4 





Sname 
1] "John Paul" "Jane Doe" 


Sage 
1] 84.5 34.5 


Ssex 
1] "Male" "Female" 


Sdead 
1 FALSE TRUI 











Gl 


r 


> str (dat) 
List of 5 
So aid = cine. pde2s). (3-4 
S$ name: chr [1:2] "John Paul" "Jane Doe" 
age : num [1:2] 84.5 34.5 
sex : chr [1:2] "Male" "Female" 
2 


Nn Mn 


dead: logi [1:2] FALSE TRUE 





To read in a data frame at the command prompt combine the data. frame, 
scan, and list functions. 


> dat <- data.frame(scan("", what = list (id=1:2, name="", 
+ age=0, sex="", dead=T)) ) 
1: 3 "John Paul" 84.5 Male F 
2: 4 "Jane Doe" 34.5 Female T 
ts 
Read 2 records 
> dat 
id name age sex dead 
1 3 John Paul 84.5 Male FALSI! 
2 4 Jane Doe 34.5 Female TRU 





a 
a 
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3.1.1.3 Method 3 


Method 3 uses the same R expressions and data as Methods | and 2, but they are 
entered into a text editor, saved as an ASCII text file with a .R extension (e.g., 
job01.R), and then executed from the command prompt using the source func- 
tion. Alternatively, the R expressions and data can be copied and pasted into R.” 

For example, the following expressions are in a text editor and saved to a file 
named job01.R. 


x <- 1:10 
x 


One can copy and paste this code into R at the commmand prompt. 


> x <- 1:10 
> x 
[1] Te Be CBD, 2A. eB 6 AM Be Oe AO 


However, if we execute the code using the source function, it will only display to 
the screen those objects that are printed using the show or print function. Here 
is the text editor code again, but including show. 


x <- 1:10 
show (x) 


Now, source job01.R using source at the command prompt. 


> source ("/home/tja/Documents/Rproj/job01.R") 
[1] 1 2 3 4 5 6 FP 8 9 10 


In general, we highly recommend using a text editor for all our work. The pro- 
gram file (e.g., job01.R) created with the text editor facilitates documenting our 
code, reviewing our code, debugging our code, replicating our analytic steps, and 
auditing by external reviewers. 


3.1.1.4 Method 4 (optional read) 


Method 4 uses R’s spreadsheet editor.* This is not a preferred method because we 
like the original data to be in a text editor or read in from a data file. We will be 
using the data.entry and edit functions. The data.entry function allows 
editing of an existing object and automatically saving the changes to the original 
object name. In contrast, the edit function allows editing of an existing object but 
it will not save the changes to the original object name; we must explicitly assign it 
to an object name (event if it is the original name). 

To enter a vector we need to initialize a vector and then use the data.entry 
function (Figure 3.1 on the next page). 


2 In RStudio, select from main menu Code > Run Region, etc. 
3 Editing of matrix and data frame objects is not currently supported in RStudio. 
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> data-entry (x) 
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Fig. 3.1 Select Help from the main menu. 


> x <- numeric(10) #Initialize vector with zeros 
> x 

[1] 0000000000 
> data.entry(x) #Enter numbers, then close window 
> x 

[1] 1 2 3 4 5 6 7 8 9 10 


However, the edit function applied to a vector does not open a spreadsheet. Try 
the edit function and see what happens. 


xnew <- edit (numeric(10)) #Edit number, then close window 


To enter data into a spreadsheet matrix, first initialize a matrix and then use 
the data.entry or edit function. Notice that the editor added default column 
names. However, to add our own column names just click on the column heading 
with our mouse pointer (unfortunately we cannot give row names). 


> xnew <- matrix (numeric (4),2,2) 
> data.entry (xnew) 
> xnew <- edit(xnew) #equivalent 
> 
> #open spreadsheet editor in one step 
> xnew <- edit (matrix (numeric (4),2,2)) 
> xnew 
coll col2 


[1,] 11 33 
[2,] 22 44 
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Arrays and nontabular lists cannot be entered using a spreadsheet editor. Hence, 
we begin to see the limitations of spreadsheet-type approach to data entry. One type 
of list, the data frame, can be entered using the edit function. 

To enter a data frame use the edit function. However, we do not need to ini- 
tialize a data frame (unlike with a matrix). Again, click on the column headings to 
enter column names. 


> df <- edit (data.frame()) #Spreadsheet screen not shown 
> df 
mykids age 
1 Tomasito 7 
2 Luisito 6 
3 Angelita S 


When using the edit function to create a new data frame we must assign it an 
object name to save the data frame. Later we will see that when we edit an existing 
data object we can use the edit or fix function. The fix function differs in that 
fix (data_object) saves our edits directly back to data_object without the 
need to make a new assignment. 


mypower <- function(x, n) {xn} 
fix (mypower) # Edits saved to /’/mypower’ object 
mypower <- edit (mypower) #equivalent 


3.1.2 Importing data from a file 


3.1.2.1 Reading an ASCII text data file 


In this section we review how to read the following types of text data files: 








e Comma-separated variable (csv) data file (+ headers and + row names) 
e Fixed width formatted data file (+ headers and + row names) 








Here is the University Group Diabetes Program randomized clinical trial text 
data file that is comma-delimited, and includes row names and a header (ugdp.txt).* 
The header is the first line that contains the column (field) names. The row names 
is the first column that starts on the second line and uniquely identifies each row. 
Notice that the row names do not have a column name associated with it. A data 
file can come with either row names or header, neither, or both. Our preference is 
to work with data files that have a header and data values that are self-explanatory. 
Even without a data dictionary one can still make sense out of this data set. 


Status, Treatment, Agegrp 
1,Dead, Tolbutamide, <55 
2,Dead, Tolbutamide, <55 








4 Available at http: //www.medepi.net/data/ugdp.txt 
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408, Survived, Placebo, 55+ 
409, Survived, Placebo, 55+ 





Notice that the header row has 3 items, and the second row has 4 items. This is 
because the row names start in the second row and have no column name. This data 
file can be read in using the read. table function, and R figures out that the first 
column are row names.> 


> ud <- read.table("http://www.medepi.net/data/ugdp.txt", 





+ header = TRUE, sep = ",") 

> head(ud) #displays lst 6 lines 
Status Treatment Agegrp 

1 Dead Tolbutamide <55 

2 Dead Tolbutamide <55 

3 Dead Tolbutamide <55 

4 Dead Tolbutamide <55 

5 Dead Tolbutamide <55 

6 Dead Tolbutamide <55 








Here is the same data file as it would appear without row names and without a 
header (ugdp2.txt). 


Dead, Tolbutamide, <55 
Dead, Tolbutamide, <55 





Survived, Placebo, 55+ 
Survived, Placebo, 55+ 





This data file can be read in using the read. table function. By default, it adds 
row names (1, 2, 3,...). 


> cnames <- c("Status", "Treatment", "Agegrp") 
> udat2 <- read.table("http://www.medepi.net/data/ugdp2.txt", 
+ header = FALSE, sep = ",", col.names = cnames) 
> head (udat2) 
Status Treatment Agegrp 
1 Dead Tolbutamide <55 
2 Dead Tolbutamide <55 
3 Dead Tolbutamide <55 
4 Dead Tolbutamide <55 
5 Dead Tolbutamide <55 
6 Dead Tolbutamide <55 








Here is the same data file as it might appear as a fix formatted file. In this file, 
columns | to 8 are for field #1, columns 9 to 19 are for field #2, and columns 20 to 22 
are for field #3. This type of data file is more compact. One needs a data dictionary 
to know which columns contain which fields. 


> If row names are supplied, they must be unique. 
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Dead Tolbutamide<55 
Dead Tolbutamide<55 








SurvivedPlacebo 55+ 
SurvivedPlacebo 55+ 





This data file would be read in using the read. fwf function. Because the field 
widths are fixed, we must strip the white space using the st rip.white option. 


> 
> 
ce 
> 


NOP WN EF 





cnames <- c("Status", "Treatment", "Agegrp") 
udat3 <- read.fwf("http://www.medepi.net/data/ugdp3.txt", 
width = c(8, 11, 3), col.names = cnames, strip.white = TRUE) 
head (udat3) 
Status Treatment Agegrp 

Dead Tolbutamide <55 

Dead Tolbutamide <55 

Dead Tolbutamide <55 

Dead Tolbutamide <55 

Dead Tolbutamide <55 

Dead Tolbutamide <55 








Finally, here is the same data file as it might appear as a fixed width formatted 
file but with numeric codes (ugdp4.txt). In this file, column 1 is for field #1, column 
2 is for field #2, and column 3 is for field #3. This type of text data file is the most 
compact, however, one needs a data dictionary to make sense of all the Is and 2s. 


121 
121 


212 
212 


Here is how this data file would be read in using the read. fwf function. 


> 
> 
+ 
> 


OB WN ER 


6 


cnames <- c("Status", "Treatment", "Agegrp") 
udat4 <- read.fwf ("http://www.medepi.net/data/ugdp4.txt", 


width = c(1, 1, 1), col.names = cnames) 

head (udat 4) 
Status Treatment Agegrp 

1 2 1 

1. 2 1 

1 2 1 

1 2 1 

1 2 1 

1 2 il 


R has other functions for reading text data files (read.csv, read.csv2, 
read.delim, read.delim2). In general, read.table is the function used 
most commonly for reading in data files. 
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3.1.2.2 Reading data from a binary format (e.g., Stata, Epi Info) 


To read data that comes in a binary or proprietary format load the foreign package 
using the library function. To review available functions in the the foreign 
package try help (package = foreign). For example, here we read in the 
‘infert’ data set which is also available as a Stata data file.® 


> idat <- read.dta("c:/.../data/infert.dta") 
> head(idat) [,1:8] 
id education age parity induced case spontaneous stratum 


1 1 0 26 6 1 1 2 1 
2 <2 0 42 1 1 1 0 2 
3. 3 0 39 6 2 1 0 3 
4 4 0 34 4 2 1 0 4 
DF. DI 1 35 3 1 1 1 5 
6 6 1 3:6 4 2 1 1 6 


3.1.3 Importing data using a URL 


As we have already seen, text data files can be read directly off a web server into R 
using the read.table function. Here we load the Western Collaborative Group 
Study data directly off a web server. 


> wdat <- read.table("http://www.medepi.net/data/wcgs.txt", 


+ header = TRUE, sep = ",") 

> str (wdat) 

‘data.frame’: 3154 obs. of 14 variables: 
S$ id : int 2001 2002 2003 2004 2005 2006 2007 2010 
S$ age0 int 49 42 42 41 59 44 44 40 43 42 
S$ heightO: int 73 70 69 68 70 72 72 71 72 70 
S$ weightO: int 150 160 160 152 150 204 164 150 190 175 
S sbpo0 int 110 154 110 124 144 150 130 138 146 132 
$ dbp0 int 76 84 78 78 86 90 84 60 76 90 
S$ chol0 int 225 177 181 132 255 182 155 140 149 325 
S$ behpatO: int 2234344232... 
S$ ncigs0O int 25 20 0 20 20 00 0 25 0 
S$ dibpatO: int 1100000101 
S$ chd69 int 0000100000... 
S$ typechd: int 0000100000... 
S$ timel69: int 1664 3071 3071 3064 1885 3102 3074 1032 
S$ arcus0O int 0100100001 





© Available at http://www.medepi.net/data/infert.dta 
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JT emacs@TOMAS_LAPTOP =/51 x} 


File Edit Options Buffers Tools Help 
Dee@*EAs *oRBexs | 


fia", "county", "age", "sex", "syndrome", "date.onset", "date.tested", "death" 2 
1,"San Bernardino", 40,"F", "UNF", "05/19/2004", "06/02/2004", "No" 
2,"San Bernardino", 64,"F", "UNF", "05/22/2004", "06/16/2004", "No" 
3,"San Bernardino",19,""", "UNF", "05/22/2004", "06/16/2004", "No" 
4,"San Bernardino",12,"M", "WNF", "05/16/2004", "06/16/2004", "No" 
5,"San Bernardino",12,"M", "WNF", "05/14/2004", "06/16/2004", "No" 
6,"San Bernardino",17,"N", "UNF", "06/07/2004", "06/17/2004", "No" 
7,"San Bernardino", 61,"M", "WNND", "06/09/2004", "06/18/2004", "No" 
8,"San Bernardino", 74,"F", "WNND", "06/14/2004", "06/22/2004", "No" 
9,"Los Angeles",71,"M", "WNF", "06/09/2004", "06/24/2004", "No" 
10,"Riverside",26,"M", "WNND", "06/13/2004", "06/24/2004", "No" 
11,"Los Angeles", 60,"M", "UNF", "06/08/2004", "06/25/2004", "No" 
12,"San Bernardino",84,"F", "WNND", "06/14/2004", "07/02/2004", "No" 
13,"San Bernardino",42,"F", "UNF", "06/09/2004", "07/05/2004", "No" 
14,"San Bernardino",50,"M", "WNND", "06/21/2004", "07/05/2004", "No" 
15,"Riverside",43,"F","UNND", "06/22/2004", "07/05/2004", "No" 
16,"Riverside",52,"N", "UNND", "06/10/2004", "07/06/2004", "No" 
17,"San Bernardino",15,"M", "WNND", "06/30/2004", "07/08/2004", "No" 














18,"San Bernardino",53,"M", "UNF", "06/28/2004", "07/09/2004", "No" i 
--\--  wnv2004raw.txt = Top (1,0) (Text Fill)-------------------------. 
tool-bar open-file : 


Fig. 3.2 Editing West Nile virus human surveillance data in text editor. Source: California Depart- 
ment of Health Services, 2004 


3.2 Editing data 


In the ideal setting, our data has already been checked, errors corrected, and ready 
to be analyzed. Post-collection data editing can be minimized by good design and 
data collection. However, we may still need to make corrections or changes in data 
values. 


3.2.1 Text editor 


For small data sets, it may be convenience to edit the data in our favorite text editor. 
Key-recording macros, and search and replace tools can be very useful and efficient. 
Figure 3.2 displays West Nile virus (WNV) infection surveillance data.’ This file is 
a comma-delimited data file with a header. 


3.2.2 The data.entry, edit, or £ix functions 


For vector and matrices we can use the data.entry function to edit these data 
object elements. For data frames and functions use the edit or fix functions. Re- 
member that changes made with the edit function are not saved unless we assign it 


7 Raw data set available at http: //www.medepi.net/data/wnv2004raw.txt, and 
clean data set available at http: //www.medepi.net/data/wnv2004fin.txt 
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Predecey -/0) x 
date.onset|date.tested| death = 
wNF 05/19/2004/ 06/02/2004 |No 
2 San Bernardino| 64 UNF 05/22/2004| 06/16/2004 |No 
3 San Bernardino| 19 UNF 05/22/2004/ 06/16/2004 |No 
4 San Bernardino] 12 UNF 05/16/2004| 06/16/2004 |No 
5 San Bernardino| 12 wNF 05/14/2004/ 06/16/2004 |No 
6 San Bernardino] 17 UNF 06/07/2004/ 06/17/2004 |No 
a San Bernardino| 61 UNND 06/09/2004/ 06/18/2004 |No 
8 San Bernardino] 74 06/14/2004] 06/22/2004 
9 WNF 06/09/2004| 06/24/2004 










Los Angeles 
10 Riverside 


a Los Angeles 


06/13/2004] 06/24/2004 
UNF 06/08/2004| 06/25/2004 
UNND 06/14/2004| 07/02/2004 
WNF 06/09/2004/ 07/05/2004 |No 
06/21/2004] 07/05/2004 
WNND 06/22/2004] 07/05/2004 
UNND 06/10/2004/ 07/06/2004 |No 








San Bernardino| 64 





San Bernardino| 42 | 





San Bernardino| 50 
43 
52 





Riverside 








Riverside 


























pepe] ce ye fd fo dt cede ct ed oe | 





San Bernardino|15 WNND 06/30/2004/ 07/08/2004 |No 
San Bernardino| 53 UNF 06/28/2004| 07/09/2004 |No 
San Bernardino] 22 WNND 06/28/2004/ 07/09/2004 |No 











Fig. 3.3 Using the fix function to edit the WNV surveillance data frame. Unfortunately, this 
approach does not facilitate documenting our edits. Source: California Department of Health Ser- 
vices, 2004 


to the original or new object name. In contrast, changes made with the fix function 
are saved back to the original data object name. Therefore, be careful when we use 
the £ix function because we may unintentionally overwrite data. 

Now let’s read in the WNV surveillance raw data as a data frame. Then, using the 
fix function, we will edit the first three records where the value for the syndome 
variable is “Unk” and change it to NA for missing (Figure 3.3). We will also change 
“” to NA. 


> wd <- read.table("http://www.medepi.net/data/wnv/wnv2004raw.txt", 





+ header = TRUE, sep = ",", asS.is = TRUE) 
> wd[wdSsyndrome=="Unknown", ][1:3,] #before edits (3 records) 
id county age sex syndrome date.onset date.tested death 


128 128 Los Angeles 81 M Unknown 07/28/2004 08/11/2004 

129 129 Riverside 44 F Unknown 07/25/2004 08/11/2004 ‘ 
133 133 Los Angeles 36 M Unknown 08/04/2004 08/11/2004 No 
> fix(wd) #open R spreadsheet and make edits (see figure) 

> wd[c(128, 129, 133),] #after edits (3 records) 











id county age sex syndrome date.onset date.tested death 
128 128 Los Angeles 81 M NA 07/28/2004 08/11/2004 NA 
129 129 Riverside 44 EF NA 07/25/2004 08/11/2004 NA 
133 133 Los Angeles 36 M NA 08/04/2004 08/11/2004 No 











First, notice that in the read. table function as .is=TRUE. This means the 
data is read in without R making any changes to it. In other words, character vectors 
are not automatically converted to factors. We set the option because we knew we 
were going to edit and make corrections to the data set, and create factors later. In 
this example, I manually started changing the missing values “Unknown” to NA (R’s 
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representation of missing values). However, this manual approach would be very 
inefficient. A better approach is to specify which values in the data frame should 
be converted to NA. In the read.table function we should have set the op- 
tionna.string=c("Unknown", "."), converting the character strings “Un- 
known” and “.” into NA. Let’s replace the missing values with NAs upon reading 
the data file. 


> wd <- read.table("http://www.medepi.net/data/wnv/wnv2004raw.txt", 











+ header = TRUE, sep = ",", aS.is = TRUE, 
+ na.string=c("Unknown", ".")) 
> wd[c(128, 129, 133),] #verify change 

id county age sex syndrome date.onset date.tested death 
128 128 Los Angeles 81 M <NA> 07/28/2004 08/11/2004 <NA> 
129 129 Riverside 44 F <NA> 07/25/2004 08/11/2004 <NA> 
133 133 Los Angeles 36 M <NA> 08/04/2004 08/11/2004 No 








3.2.3 Vectorized approach 


How do we make these and other changes after the data set has been read into R? 
Although using R’s spreadsheet function is convenient, we do not recommend it 
because manual editing is inefficient, our work cannot be replicated and audited, 
and documentation is poor. Instead use R’s vectorized approach. Let’s look at the 
distribution of responses for each variable to assess what needs to be “cleaned up,” 
in addition to converting missing values to NA. 


> wd <- read.table("http://www.medepi.net/data/wnv/wnv2004raw.txt", 





+ header = TRUE, sep = ",", aS.is = TRUE) 
> str (wd) 
‘data.frame’: 779 obs. of 8 variables: 
S$ id : int 12 3.4 5 6 7 8 9 10 
S$ county : chr "San Bernardino" "San Bernardino" 
S$ age i2echr “40M T64e M1 Ow UTM 
S$ sex chr TE op eM Me ea 
S$ syndrome 2 @hie "WNE" "WNE" "WNE" "WNE" 
S date.onset : chr "05/19/2004" "05/22/2004" 
S$ date.tested: chr "06/02/2004" "06/16/2004" 
S$ death : chr "No" "No" "No" "No" 


> lapply(wd, table) #apply ’table’ function to fields 
Sid 


768 769 770 771 772 773 774 775 776 777 778 779 780 781 
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Scounty 


Butte 

4 

Kern 

59 

Merced 

1 
Sacramento 
3 


Santa Clara 





Tulare 


Sage 


110 11 12 
6» dH ll wl. 38 


82 83 84 85 86 
10 5 6 4 2 


Ssex 


: EF M 
2 294 483 


Ssyndrome 


WNF 
391 


Unknown 
105 


Sdate.onset 


3 Managing epidemiologic data in R 


a: ik ‘l 1 1 1 a 
Fresno Glenn 
dh 3 
Lake Lassen 
1 1 
Orange Placer 
62 1 
San Bernardino San Diego 
187 Z 
Shasta Sn Luis Obispo 
5 1 
Ventura Yolo 
2 1 


13 14 15 16 17 
Z «3-0 34 a Ad 


87 88 89 9 91 
2 1 6 1 4 


WNND 
283 


18 19 
6 5 


93 94 
de 


iL 


1 1 


Imperial 

al 

Los Angeles 
306 
Riverside 
109 

San Joaquin 
2 

Tehama 

10 


2 20 21 22 23 24 25 26 


1 4 2 


3 


6 8 3 9 


02/02/2005 05/14/2004 05/16/2004 05/19/2004 05/22/2004 


1 


1 


1 


1 


2 


10/28/2004 10/29/2004 10/30/2004 11/08/2004 11/12/2004 


2 


Sdate.tested 


Applied Epidemiology Using R 


1 


14-Oct-2013 


1 


4 


2 


© Tomas J. Aragon (www.medepi.com) 


3.2 Editing data 123 


01/21/2005 02/04/2005 02/23/2005 06/02/2004 06/16/2004 
1 1 1 1 4 


11/29/2004 12/02/2004 12/03/2004 12/07/2004 


8 1 2 Li 
Sdeath 
No Yes 
66 686 27 


What did we learn? First, there are 779 observations and 781 id’s; therefore, 3 
observations were removed from the original data set. Second, we see that the vari- 
ables age, sex, syndrome, and death have missing values that need to be converted 
to NAs. This can be done one field at a time, or for the whole data frame in one step. 
Here is the R code. 


#individually 

wdSage[wdSage=="."] <- NA 
wdSsex[wd$sex=="."] <- NA 

wdSsyndrome [wd$syndrome=="Unknown"] <-— NA 
wdSdeath [wd$death=="."] <- NA 


#or globally 
wd [wd=="." | wd=="Unknown"] <- NA 


After running the above code, let’s evaluate one variable to verify the missing 
values were converted to NAs. 


> table (wd$death) 
No Yes 


686 27 
> table(wd$death, exclude=NULL) 


No Yes <NA> 
686 27] 66 


We also notice that the entry for one of the counties, San Luis Obispo, was mis- 
spelled (Sn Luis Obispo). We can use replacement to make the corrections: 


> wdSCounty [wdSCounty=="Sn Luis Obispo"] <- "San Luis Obispo" 


3.2.4 Text processing 


On occasion, we will need to process and manipulate character vectors using a vec- 
torized approach. For example, suppose we need to convert a character vector of 
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dates from “mm/dd/yy” to “yyyy-mm-dd”.® We’ll start by using the subst r func- 
tion. This function extracts characters from a character vector based on position. 


> bd <- c("07/17/96", "12/09/00", "11/07/97") 
> mon <- substr(bd, start=1, stop=2); mon 
Lae MO? es ae 

> day <- substr(bd, 4, 5); day 

Pa, Lee OI Ore 

> yr <- as.numeric(substr(bd, 7, 8)); yr 


[1] 96 0 97 

> yr2 <- ifelse(yr<=19, yr+2000, yrt+1900); yr2 

[1] 1996 2000 1997 

> bdfin <- paste(yr2, "-", mon, "-", day, sep=""); bdfin 
[1] "1996-07-17" "2000-12-09" "1997-11-07" 


In this example, we needed to convert “O00” to “2000”, and “96” and “97 to “1996” 
and “1997”, respectively. The trick here was to coerce the character vector into 
a numeric vector so that 1900 or 2000 could be added to it. Using the ifelse 
function, for values < 19 (arbitrarily chosen), 2000 was added, otherwise 1900 was 
added. The paste function was used to paste back the components into a new 
vector with the standard date format. 

The subst r function can also be used to replace characters in a character vector. 
Remember, if it can be indexed, it can be replaced. 


> bd 

[1] "07/17/96" "12/09/00" "11/07/97" 
> substr(bd, 3, 3) <- "-" 

> substr(bd, 6, 6) <- "-" 

> bd 

fl] - "07-17-96" "12-09-00" "11-07-97" 


3.3 Sorting data 


The sort function sorts a vector as expected: 


> x <- sample(1:10, 10); x 
[1] A230 6 1 Pf 9. . 3d: Bo Be EO 

> sort (x) 
[1] T 2 3 4 BS 6 7 8-9. 10 

> sort (x, decreasing = TRUE) #reverse sort 
{1] 10 9 8 7 6 5 4 3 2 1 

> rev (sort (x) ) #reverse sort 
[el LO} 9) Be OG SB Ae 2S S21 





8 ISO 8601 is an international standard for date and time representations issued by the International 
Organization for Standardization (ISO). See http: //www.iso.org 
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Table 3.2 R functions for processing text in character vectors 








Function Description Try these examples 
nchar Returns the number & <= c("a™, “ab™, “abe", “abed") 
of characters ineach nchar (x) 
element of a character 
vector 
substr Extract or replace #extraction 
substrings in a mon <- substr(some.dates, 1, 2); mon 
character vector day <- substr(somée.dates, 4, 5); day 
yr <- substr(some.dates, 7, 8); yr 
#replacement 
mdy <- paste(mon, day, yr); mdy 
substr(mdy, 3, 3) <- ‘'/' 
substr(mdy, 6, 6) <- '/' 
mdy 
paste Concatenate vectors rd <- paste(mon, "/", day, "/", yr, 
after converting to sep="") 
character rd 
strsplit Split the elements of some.dates <- c("10/02/70", "02/04/67") 


a character vector 


some.dates 


into substrings w/w) 


strsplit (some.dates, 





However, if we want to sort one vector based on the ordering of elements from 
another vector, use the order function. The order function generates an index- 
ing/repositioning vector. Study the following example: 





> x <- sample(1:20, 5); x 
[1] 18 10 613 11 
> sort (x) #sorts as expected 
[1] 6 10 11 13 18 
> y <- sample(1:20, 5); y 
EL]. Lh, LO 13: <8. 29 
> order (y) #4th element to lst position, 5th to 2nd, etc. 
fa) 4920S 
> x[order(y)] #use order(y) to sort elements of x 
[1] 13 11 10 18 6 
Based on this we can see that sort (x) is just x [order (x) ]. 
Now let us see how to use the order function for data frames. First, we create a 
small data set. 
> sex <- rep(c("Male", "Female"), c(4, 4)) 
> ethnicity <- rep(c("White", "African American", "Latino", 
+ "Asian"), 2) 
> age <- sample(1:100, 8) 
> dat <- data.frame(age, sex, ethnicity) 
> dat <- dat[sample(1:8, 8),] #randomly order rows 
> dat 
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age sex ethnicity 
5 57 Female White 
8 93 Female Asian 
1 7 Male White 
4 65 Male Asian 
6 38 Female African American 
3. <27 Male Latino 
2 66 Male African American 
7 72 Female Latino 


Okay, now we will sort the data frame based on the ordering of one field, and then 
the ordering of two fields: 


> dat [order (datSage),] #sort based on 1 variable 


age sex ethnicity 
J a Male White 
S27 Male Latino 
6 38 Female African American 
5 57 Female White 
4 65 Male Asian 
2 66 Male African American 
7 72 Female Latino 
8 93 Female Asian 
> dat [order (datSsex, datSage),] #sort based on 2 variables 

age sex ethnicity 
6 38 Female African American 
5 57 Female White 
7 72 Female Latino 
8 93 Female Asian 
1 i) Male White 
3-9 Qi Male Latino 
4 65 Male Asian 
2: 1:66: Male African American 





3.4 Indexing (subsetting) data 


For this section, please load the well known Oswego foodborne illness dataset: 


> odat <- read.table("http://www.medepi.net/data/oswego.txt", 





+ header = TRUE, as.is = TRUE, sep = "") 

> str (odat) 

‘data.frame’: 75 obs. of 21 variables: 
S$ id : int 23 40678 910 14 16 
S$ age : int 52 65 59 63 70 40 15 33 10 32 
S$ sex . chr "he" uy" wee wee" 
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S$ meal.time : chr "8:00 PM" "6:30 PM" "6:30 PM" 
S$ ill : chr wy" anya mye you . 

S$ onset.date >: chr "4/19" "4/19" "4/19" "4/18" 

S$ onset.time : chr "12:30 AM" "12:30 AM" 

S baked.ham Chek CMY Myo y ays 

$ vanilla.ice.cream : chr "y" "y™ "y"™ Wy" 

S$ chocolate.ice.cream: chr "N" "y™ "y™ wy 

S$ fruit.salad s.GhK AEN ENE EN TEN 


3.4.1 Indexing 


Now, we will practice indexing rows from this data frame. First, we create a new 
data set that contains only cases. To index the rows with cases we need to generate 
a logical vector that is TRUE for every value of odat $i11 that “is equivalent to” 
"y". For “is equivalent to” we use the == relational operator. 





> cases <- odat$ill=="yY" 
> cases 
[eds] TRUE TRUE RUE TRUE TRUE TRUE TRUE TRUE TRUI 




















E 

















[73] FALSE FALSE FALSE 
> odat.ca <- odat[cases, ] 
> odat.ca[, 1:8] 

id age sex meal.time ill onset.date onset.time baked.ham 








i. 2 52 F 8:00 PM Y 4/19 12:30 AM Y 
2 3 65 M 6:30 PM Y 4/19 12:30 AM Y 
3 4 59 F 6:30 PM ¥. 4/19 12:30 AM Y 
4 6 63 F 7:30 PM Y 4/18 10:30 PM Y 
43 71 60 M 7:30 PM ¥. 4/19 1:00 AM N 
44 72 18 F 7:30 PM Y 4/19 12:00 AM Y 
45 74 52 M <NA> ¥. 4/19 2:15 AM Y 
46 75 45 F <NA> ¥ 4/18 11:00 PM x 


It is very important to understand what we just did: we extracted the rows with cases 
by indexing the data frame with a logical vector. 

Now, we combine relational operators with logical operators to extract rows 
based on multiple criteria. Let’s create a data set with female cases, age less than 
the median age, and consumed vanilla ice cream. 


> fem.cases.vic <- odatS$ill=="Y" & odatSsex=="F" & 

+ odatS$vanilla.ice.cream=="Y" & odatSage < median (odatSage) 
> odat.fcv <- odat[fem.cases.vic, ] 

> odat.fev[ , c(1:6, 19) ] 
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id age sex meal.time ill onset.date vanilla.ice.cream 








8 10 33 F 7:00 PM ¥ 4/18 Y 
10 16 32 F <NA> Yi 4/19 Y 
13 20 33 F <NA> ¥ 4/18 ¥ 
14 21 13 F 10:00 PM ¥ 4/19 Y 
18 27 15 F 10:00 PM Y 4/19 Y 
23 36 35 F <NA> ¥ 4/18 iv 
31 48 20 F 7:00 PM Y 4/19 Y 
37 58 12 F 10:00 PM Y: 4/19 Y 
40 65 17 F 10:00 PM ¥: 4/19 Y 
41 66 8 F <NA> Y 4/19 Y 
42 70 21 F <NA> ¥. 4/19 ¥ 
44 72 18 F 7:30 PM Y 4/19 ¥ 


In summary, we see that indexing rows of a data frame consists of using rela- 
tional operators (<x, >, <=, >=, ==, !=) and logical operators (&, |, !) 
to generate a logical vector for indexing the appropriate rows. 


3.4.2 Subsetting 


Subsetting a data frame using the subset function is equivalent to using logical 
vectors to index the data frame. In general, we prefer indexing because it is general- 
izable to indexing any R data object. However, the subset function is a convenient 
alternative for data frames. Again, let’s create data set with female cases, age < me- 
dian, and ate vanilla ice cream. 














> odat.fcv <- subset (odat, subset = {ill=="Y" & sex=="F" & 
+ vanilla.ice.cream=="Y" & age < median(odatSage) }, 
+ select = c(id:onset.date, vanilla.ice.cream) ) 
> odat.fcv 
id age sex meal.time ill onset.date vanilla.ice.cream 
8 10 33 F 7:00 PM Y 4/18 ne 
10 16 32 F Y, 4/19 ¥ 
13 20 33 F , Y 4/18 Ys 
LA. 2a <3 F 10:00 PM ¥ 4/19 Y 
18 27 15 F 10:00 PM ¥. 4/19 Y 
23 36 35 F ‘ ¥. 4/18 Y 
31 48 20 F 7:00 PM Y 4/19 ¥ 
37 58 12 F 10:00 PM Y 4/19 Y 
40 65 17 F 10:00 PM ¥, 4/19 x. 
41 66 8 F Y 4/19 Y 
42 70 21 F Y: 4/19 ¥. 
44 72 18 F 7:30 PM Y 4/19 Y 
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In the subset function, the first argument is the data frame object name, the 
second argument (also called subset) evaluates to a logical vector, and third argu- 
ment (called select) specifies the fields to keep. In the second argument, 


subset = {... 


} 


the curly brackets are included for convenience to group the logical and relational 
operations. In the select argument, using the : operator, we can specify a range 


of fields to keep. 


3.5 Transforming data 


Transforming fields in a data frame is very common. The most common transfor- 
mations include the following: 


e Numerical transformation of a numeric vector 
e Discretizing a numeric vector into categories or levels (“categorical variable’’) 
e Re-coding integers that represent levels of a categorical variable 


For each of these, we must decide whether the newly created vector should be a new 
field in the data frame, overwrite the original field in the data frame, or not be a field 
in the data frame (but rather a vector object in the workspace). For the examples that 
follow load the well known Oswego foodborne illness dataset: 


> odat <- read.table("http://www.medepi.net/data/oswego.txt", 
+ header = TRUE, as.is = TRUE, sep = "") 





3.5.1 Numerical transformation 


> # transform age variable centering it 


> # create 
> odatSage 
[1] 52 65 


[73] 17 36 


> odatSage. 
> odatSage. 


new field in same data frame 


99° -638 70! 40-15" 33) 106-32. 1625. 362-33: 13: -T 6 899° LS 


14 
centered <- odatSage - mean (odatSage) 
centered 


[1] 15.1866667 28.1866667 22.1866667 26.1866667 


73] -19.8133333 -0.8133333 -22.8133333 


overwrite original field in same data frame (not recommended!!!) 
odatSage <- odatSage - mean (odatSage) 
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> # create new vector in workspace; data frame remains unchanged 
> age.centered <- odatSage - mean (odatSage) 
> age.centered 

[1] 15.1866667 28.1866667 22.1866667 26.1866667 


[73] -19.8133333 -0.8133333 -22.8133333 


For convenience, the transform function facilitates the transformation of nu- 
meric vectors in a data frame. The transform function comes in handy when we 
plan on transforming many fields: we do not need to specify the data frame each 
time we refer to a field name. For example, the following lines are are equivalent. 
Both add a new transformed field to the data frame. 


odatSage.centered <- odatSage - mean (odatSage) 
odat <- transform(odat, age.centered = ag mean (age) ) 





3.5.2 Creating categorical variables (factors) 


Now, reload the Oswego data set to recover the original odat Sage field. We are 
going to create a new field with the following seven age categories (in years): < 1, 1 
to 4, 5 to 14, 15 to 24, 25 to 44, 45 to 64, and 65+. We will demonstrate this using 
several methods: 


3.5.2.1 Using cut function (preferred method) 


> agecat <- cut(odatSage, breaks = c(0, 1, 5, 15, 25, 45, 
+ 65, 100) ) 
> agecat 

[1] (45,65] (45, 65] (45, 65] (45, 65] (65,100] (25,45] 


[73] (155.25 ] (25,45] (5,15] 
Levels: (0,1] (1,5] (5,15] (15,25] (25,45) ... (65,100) 


Note that the cut function generated a factor with 7 levels for each interval. The 
notation (15, 25] means that the interval is open on the left boundary (> 15) and 
closed on the right boundary (< 25). However, for age categories, it makes more 
sense to have age boundaries closed on the left and open on the right: [a, b).To 


change this we set the option right = FALSE 





> agecat <- cut(odatSage, breaks = c(0, 1, 5, 15, 25, 45, 
+ 65, 100), right = FALSE) 
> agecat 

[1] [45,65) [65,100) [45,65) [45, 65) [65,100) [25,45) 
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[AS E15 4225) [25,45) [5,15) 


Levels: [0,1) [1,5) [5,15) [15,25) [25,45) ... [65,100) 


> table(agecat) 
agecat 


[0,1) [t73) [pla [15,25) [25,45) [45,65) [65,100) 


) 1 14 13 18 20 


Okay, this looks good, but we can add labels since our readers may not be familiar 
with open and closed interval notation [a, b). 


> agelabs <- c("<1", "1-4", "5-14", "15-24", "25-44", 
+ "e5+") 

> agecat <- cut(odatSage, breaks = c(0, 1, 5, 15, 25, 
+ 65, 100), right = FALSE, labels = agelabs) 

> agecat 

[1] 45-64 65+ 45-64 45-64 65+ 25-44 15-24 


[71] 5-14 5-14 15-24 25-44 5-14 

Levels: <1 1-4 5-14 15-24 25-44 45-64 654 

> table(agecat, case = odat$ill) 

case 

agecat N 
<1 
1-4 
5-14 
15-24 
25-44 
45-64 
65+ 





PR 
NDnNOCADAFOK 


Wwe ueamod oOo 


3.5.2.2 Using indexing and assignment (replacement) 


The cut function is the preferred method to create a categorical variable. However, 
suppose one does not know about the cut function. Applying basic R concepts 
always works! 


agegroup <- odatSage 

agegroup[odatSage<1] <- 1 

agegroup[odatSage>=1 & odatSage<5] <- 2 
agegroup[odatSage>=5 & odatSage<15] <- 3 
agegroup[odatSage>=15 & odatSage<25] <- 4 
agegroup[odatSage>=25 & odatSage<45] <- 5 

agegroup [odatS$age>=45 & odatSage<65] <- 6 
agegroup[odatSage>=65] <- 7 

#create factor 

agelabs <- c("<1", "1 to 4", "5 to 14", "15 to 24", 


aoa 


VVVVVVVV VV 
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45, 


132 3 Managing epidemiologic data in R 


+ "207 £O.-44") "45 tG 64"; . -MOS+") 
> agegroup <- factor(agegroup, levels = 1:7, labels = agelabs) 
> agegroup 

[1] 45 to 64 65+ 45 to 64 45 to 64 65+ 25 to 44 


[73] 15 to 24 25 to 44 5 to 14 
7 Levels: <1 1 to 4 5 to 14 15 to 24 25 to 44 ... 65+ 
> table(case odat$ill, agegroup) 


agegroup 
case <1 1 to 4 5 to 14 15 to 24 25 to 44 45 to 64 65+ 
N 0O 0 8 5 8 5 3 
Y O 1 6 8 10 15 6 


In these previous examples, notice that agegroup is a factor object that is not 
a field in the odat data frame. 


3.5.3 “Re-coding”’ levels of a categorical variable 


In the previous example the categorical variable was a numeric vector (1, 2, 3, 4, 5, 
6, 7) that was converted to a factor and provided labels (“<1’, “1 to 4’, “5 to 14”, 
...). In fact, categorical variables are often represented by integers (for example, 0 
= no, | = yes; or 0 = non-case, | = case) and provided labels. Often, ASCII text data 
files are integer codes that require a data dictionary to convert these integers into 
categorical variables in a statistical package. In R, keeping track of integer codes 
for categorical variables is unnecessary. Therefore, re-coding the underlying integer 
codes is also unnecessary; however, if we feel the need to do so, here’s how. 


> # Create categorical variable 
> ethlabs <- c("White", "Black", "Latino", "Asian") 
> ethnicity <- sample(ethlabs, 100, replace = T) 
> ethnicity <- factor(ethnicity, levels = ethlabs) 
> ethnicity 
[1] Black Asian atino White Black Asian White Black 


[97] Black Black Asian Latino 
Levels: White Black Latino Asian 








The levels option allowed us to determine the display order, and the first level 
becomes the reference level in statistical models. To display the underlying numeric 
code use unclass function which preserves the levels attribute.? 


> x <- unclass (ethnicity) 
> x 
PL) 6.2. 4. BT 4d Dod 3 232 A. 3 2h hd SB 2 2d oS 4A 2 3 


° The as . integer function also works but does not preserve the levels attribute. 
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[85] .2°2 3°.3° 3 3 3:4 4A) 12-2 4.3 
attr(,"levels") 
[1] "White" "Black" "Latino" "Asian" 


To recover the original factor, 





> factor (x, labels=levels (x) ) 
[1] Black Asian Latino White Black Asian White 


[92] Latino Asian Asian White White Black Black 
[99] Asian Latino 
Levels: White Black Latino Asian 


Although one can extract the integer code, why would one need to do so? One 
is tempted to use the integer codes as a way to share data sets. However, we rec- 
ommend not using the integer codes, but rather just provide the data in its native 
format.!° This way, the raw data is more interpretable and eliminates the interme- 
diate step of needing to label the integer code. Also, if the data dictionary is lost or 
not provided, the raw data is still interpretable. 

In R, we can re-label the levels using the levels function and assigning to it a 
character vector of new labels. Make sure the order of the new labels corresponds to 
order of the factor levels. 


> levels(ethnicity2) <- c("Caucasion", "African American", 


+ "Hispanic", "Asian") 
> table(ethnicity2) 
ethnicity2 
Caucasion African American Hispanic Asian 
23 31 28 18 


In R, we can re-order and re-label at the same time using the levels function 
and assigning to it a list. 


> table(ethnicity) 
ethnicity 
White Black Latino Asian 
23 31 28 18 
> ethnicity3 <- ethnicity 
> levels(ethnicity3) <- list (Hispanic = "Latino", Asian = "Asian", 
+ Caucasion = "White", "African American" = "Black") 
> table (ethnicity3) 
ethnicity3 
Hispanic Asian Caucasion African American 
28 18 23 34 


The 1ist function is necessary to assure the re-ordering. To re-order without re- 
labeling just do the following: 


10 For example, http: //www.medepi.net/data/oswego.txt 
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> table(ethnicity) 





ethnicity 
White Black Latino Asian 
23 31 28 18 
> ethnicity4 <- ethnicity 
> levels(ethnicity4) <- list(Latino = "Latino", Asian = "Asian", 
+ White = "White", Black = "Black") 
> table (ethnicity4) 
ethnicity4 
Latino Asian White Black 
28 18 23 31 


In R, we can sort the factor levels by using the factor function in one of two 
ways: 


> table(ethnicity) 


ethnicity 
White Black Latino Asian 
23 31 28 18 


> ethnicity5a <- factor(ethnicity, sort (levels(ethnicity) ) ) 
> table(ethnicity5a) 


ethnicity5a 
Asian Black Latino White 
18 3 28 23 


> ethnicity5b <- factor(as.character (ethnicity) ) 
> table(ethnicity5b) 





ethnicity5b 
Asian Black Latino White 
18 31 28 23 


In the first example, we assigned to the levels argument the sorted level names. 
In the second example, we started from scratch by coercing the original factor into 
a character vector which is then ordered alphabetically by default. 


3.5.3.1 Setting factor reference level 


The first level of a factor is the reference level for some statistical models (e.g., 
logistic regression). To set a different reference level use the relevel function. 











> levels (ethnicity) 

[1] "White" "Black" "Latino" "Asian" 

> ethnicity6 <- relevel(ethnicity, ref = "Asian") 
> levels (ethnicity6) 

[1] "Asian" "White" "Black" "Latino" 
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Table 3.3 Categorical variable represented as a factor or a set of dummy variables 


Factor Dummy variables 


Ethnicity Asian Black Latino 


White 0 0 0 
Asian 1 0 0 
Black 0 1 0 
Latino 0 0 1 


As we can see, there is tremendous flexibility in dealing with factors without 
the need to “re-code” categorical variables. This approach facilitates reviewing our 
work and minimizes errors. 


3.5.4 Use factors instead of dummy variables 


A nonordered factor (nominal categorical variable) with k levels can also be rep- 
resented with k— 1 dummy variables. For example, the ethnicity factor has 
four levels: white, Asian, black, and Latino. Ethnicity can also be represented using 
3 dichotomous variables, each coded 0 or 1. For example, using white as the ref- 
erence group, the dummy variables would be asian, black, and latino (see 
Table 3.3). The values of those three dummy variables (0 or 1) are sufficient to 
represents one of four possible ethnic categories. Dummy variables can be used in 
statistical models. However, in R, it is unnecessary to create dummy variables, just 
create a factor with the desired number of levels and set the reference level. 


3.5.5 Conditionally transforming the elements of a vector 


We can conditionally transform the elements of a vector using the ifelse func- 
tion. This function works as follows: ifelse(test, if test = TRUE do 
this, else do this). 


> x <- sample(c("M", "F"), 10, replace = T); x 
[1] myn whe" uy "he" my" wee my Le my" whee" 





> y <- ifelse(x=="M", "Male", "Female") 

2 y 
[1] "Male" "Female" "Male" "Female" "Male" "Female" 
[7] "Male" "Female" "Male" "Female" 
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Table 3.4 R functions for transforming variables in data frames 

















Function Description Try these examples 
<= Transforming a dat <- data.frame(id=1:3, x=c(0.5,1,2)); 
vector and assigning dat 
it to a new data dat$logx <- log(x) #creates new field 
frame variable name dat 
transform Transform one or dat <- data.frame(id=1:3, x=c(0.5,1,2)); 
more variables from dat 
a data frame dat <- transform(dat, logx = log(x)) 
dat 
cut Creates a factor by age <- sample(1:100, 500, replace = TRUE) 
dividing the range of # cut into 2 intervals agecut <- cut (age, 
anumeric vectorinto 2, right = FALSE) 
intervals table (agecut) 
#cut using specified intervals agecut2 <- 
cut (age, c(0, 50 100), 
right = FALSE, include.lowest = TRUE) 
table (agecut2) 
levels Gives access tothe sex <- sample(c("M","F","T"),500, 
levels attribute ofa replace=T) 
factor sex <-— factor(sex) 
table (sex) 
relabel each level; use same order 
levels (sex) <- c("Female", "Male", 
"Transgender" ) 
table (sex) 
relabel/recombine 
levels(sex) <- c("Female", "Male", 
"Male") 
table (sex) 
reorder and/or relabel 
levels(sex) <- list ("Men" = "Male", 
"Women" = "Female") 
table (sex) 
relevel Set the reference sex2 <- relevel(sex, ref = "Women") 
level for a factor table (sex2) 
ifelse Conditionally age <- sample(1:100, 1000, replace = 
operate on elements TRUE) 
of avector basedon agecat <- ifelse(age<=50, "<=50", ">50") 


a test 


table (agecat) 





3.6 Merging data 


In general, R’s strength is not data management but rather data analysis. Because 
R can access and operate on multiple objects in the workspace it is generally not 
necessary to merge data objects into one data object in order to conduct analyses. 
On occasion, it may be necessary to merge two data frames into one data frames 
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Data frames that contain data on individual subjects are generally of two types: 
(1) each row contains data collected on one and only one individual, or (2) multiple 
rows contain repeated measurements on individuals. The latter approach is more 
efficient at storing data. For example, here are two approaches to collecting multiple 
telephone numbers for two individuals. 





> tabl 

name wphone fphone mphone 
1 Tomas Aragon 643-4935 643-2926 847-9139 
2 Wayne Enanoria 643-4934 <NA> <NA> 
> 
> tab2 

name telphone teletype 
1 Tomas Aragon 643-4935 Work 
2 Tomas Aragon 643-2926 Fax 
3 Tomas Aragon 847-9139 Mobile 
4 Wayne Enanoria 643-4934 Work 


The first approach is represented by t ab1, and the second approach by tab2.!! 
Data is more efficiently stored in t ab2, and adding new types of telephone numbers 
only requires assigning a new value (e.g., Pager) to the teletype field. 





> tab2 

name telphone teletype 
Ai Tomas Aragon 643-4935 Work 
2 Tomas Aragon 643-2926 Fax 
3 Tomas Aragon 847-9139 Mobile 
4 Wayne Enanoria 643-4934 Work 
5 Tomas Aragon 719-1234 Pager 


In both these data frames, an indexing field identifies an unique individual that is 
associated with each row. In this case, the name column is the indexing field for 
both data frames. 

Now, let’s look at an example of two related data frames that are linked by an 
indexing field. The first data frame contains telephone numbers for 5 employees 
and fname is the indexing field. The second data frame contains email addresses 
for 3 employees and name is the indexing field. 


> phone 

fname phonenum phonetype 
1 Tomas 643-4935 work 
2 Tomas 847-9139 mobile 
3 Tomas 643-4926 fax 
4 Chris 643-3932 work 
5 Chris 643-4926 fax 


'! This approach is the basis for designing and implementing relational databases. A relational 
database consists of multiple tables linked by an indexing field. 
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6 
4 
8 
9 


10 
> 


NOB WN EP 


Wayne 
Wayne 
Ray 
Ray 
Diana 
email 
name 
Tomas 
Tomas 
Wayne 
Wayne 
Chris 
Chris 


643-4934 work 
643-4926 fax 
643-4933 work 
643-4926 fax 
643-3931 work 

ma 


aragon@berkeley.e 
aragon@medepi.n 
enanoria@berkeley.e 
enanoria@idready.o 
csiador@berkeley.e 
csiador@yahoo.c 
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il mailtype 
du Work 
et Personal 
du Work 
rg Work 
du Work 
om Personal 


To merge these two data frames use the merge function. 














> dat <- merge(email, phone 
> dat 
name m 
1 Chris csiador@berkeley. 
2 Chris csiador@yahoo. 
3 Chris csiador@berkeley. 
4 Chris csiador@yahoo. 
5 Tomas aragon@berkeley. 
6 Tomas aragon@medepi. 
7 Tomas aragon@berkeley. 
8 Tomas aragon@medepi. 
9 Tomas aragon@berkeley. 
10 Tomas aragon@medepi. 
11 Wayne enanoria@berkeley. 
12 Wayne enanoria@idready. 
13 Wayne enanoria@berkeley. 
14 Wayne enanoria@idready. 
> dat <- merge(phone, email 
> dat 
fname phonenum phonetype 
1 Chris 643-3932 work 
2 Chris 643-4926 fax 
3 Chris 643-3932 work 
4 Chris 643-4926 fax 
5 Tomas 643-4935 work 
6 Tomas 847-9139 mobile 
7 Tomas 643-4926 fax 
8 Tomas 643-4935 work 
9 Tomas 847-9139 mobile 
10 Tomas 643-4926 fax 
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, by.x="name", by.y="fname") 


ail mailtype phonenum phonetype 
work 
work 
fax 
fax 
work 
work 
mobile 
mobile 
fax 
fax 
work 
work 
fax 
fax 











edu Work 643-3932 
com Personal 643-3932 
edu Work 643-4926 
com Personal 643-4926 
edu Work 643-4935 
net Personal 643-4935 
edu Work 847-9139 
net Personal 847-9139 
edu Work 643-4926 
net Personal 643-4926 
edu Work 643-4934 
org Work 643-4934 
edu Work 643-4926 
org Work 643-4926 


, by.x="fname", by.y="name") 


mail 
csiador@berkeley.edu 
csiador@berkeley.edu 
cvsiador@yahoo.com 
cvsiador@yahoo.com 
aragon@berkeley.edu 
aragon@berkeley.edu 
aragon@berkeley.edu 
aragon@medepi.net 
aragon@medepi.net 
aragon@medepi.net 





mailtype 
Work 
Work 


Personal 





Personal 


Work 
Work 
Work 


Personal 
Personal 





Personal 
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11 Wayne 643-4934 work enanoria@berkeley.edu Work 
12 Wayne 643-4926 fax enanoria@berkeley.edu Work 
13 Wayne 643-4934 work enanoria@idready.org Work 
14 Wayne 643-4926 fax enanoria@idready.org Work 

The by.x and by. y options identify the indexing fields. By default, R selects the 

rows from the two data frames that is based on the intersection of the indexing fields 

(by .x, by.y). To merge the union of the indexing fields, set al l=TRUE: 
> dat <- merge(phone, email, by.x="fname", by.y="name", 
+ all=TRUE) 
> dat 

fname phonenum phonetype mail mailtype 

1 Chris 643-3932 work csiador@berkeley.edu Work 
2 Chris 643-4926 fax csiador@berkeley.edu Work 
3 Chris 643-3932 work cSiador@yahoo.com Personal 
4 Chris 643-4926 fax cSiador@yahoo.com Personal 
5 Diana 643-3931 work <NA> <NA> 
6 Ray 643-4933 work <NA> <NA> 
f Ray 643-4926 fax <NA> <NA> 
8 Tomas 643-4935 work aragon@berkeley.edu Work 
9 Tomas 847-9139 mobile aragon@berkeley.edu Work 
10 Tomas 643-4926 fax aragon@berkeley.edu Work 
11 Tomas 643-4935 work aragon@medepi.net Personal 
12 Tomas 847-9139 mobile aragon@medepi.net Personal 
13 Tomas 643-4926 fax aragon@medepi.net Personal 
14 Wayne 643-4934 work enanoria@berkeley.edu Work 
15 Wayne 643-4926 fax enanoria@berkeley.edu Work 
16 Wayne 643-4934 work enanoria@idready.org Work 
17 Wayne 643-4926 fax enanoria@idready.org Work 





To “reshape” tabular data look up and study the reshape and st ack functions. 


3.7 Executing commands from, and directing output to, a file 


3.7.1 The source function 


We use the source function to execute R commands are contained in an ASCII 
text file. For example, consider the contents this source file (chap03.R): 


i <- 1:5 
x <- outer(i, i, "*") 
show (x) 


Here we run source from the R command prompt: 
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> source ("/home/tja/Documents/wip/epir/r/chap03.R") 


[ph eZ oly cls 4) te] 
og 1 2 3 4 5 
[2,] 2 4 6 8 10 
Psa) 3 6 9 12 15 
[4,] 4 @. «1d <e-. 26 
[5,] 5 GO |S. Bo 25 


Tv 


Nothing is printed to the console unless we explicitly use the the show (or print) 
function. This enables us to view only the results we want to review. 

An alternative approach is to print everything to the console as if the R commands 
were being enter directly at the command prompt. For this we do not need to use the 
show function in the source file; however, we must set the echo option to TRUE 
in the source function. Here is the edited source file (chap03.R) 


i <- 1:5 
x <> outer (i, 2,. Mat) 
x 


Here we run source from the R command prompt: 


> source ("/home/tja/Documents/wip/epir/r/chap03.R", echo=TRUE) 
>i <- 1:5 
Sx <= outer(i; 1, "e") 


> xX 

Pedal 2) LoS]. 4) ts 3] 
[1,] a 2 3 4 5 
[21 2 4 6 8 10 
ew 3 6 Ge WF. 2s 
[4,] 4 @. 42 Te 26 
[5,] SB dO» 1S.- “BO: 25 


3.7.2 The sinkand capture. output functions 


We can add code to the source file to “sink” selected results to an output file using 
the sink or capture. output functions. Consider our edited source file: 
ai oe ES 
x <- outer(i, i, "*") 
sink ("/home/t ja/Documents/wip/epir/r/chap03.log") 
cat ("Here are the results of the outer function", f1i11=TRUI 
show (x) 
sink () 





GI 
~~ 


Here we run source from the R command prompt: 
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> source ("/home/tja/Documents/wip/epir/r/chap03.R") 
> 


Nothing was printed to the console because sink sent it to the output file (chap03.log). 
Here are the contents of chap03.log: 


Here are the results of the outer function 


[,1] [,2] [,3] [,4] [,95] 
[1,] 1 2 3 4 5 
[2, ] 2 4 6 8 10 
[3,] 3 6 9 12 15 
[4, ] 4 8 12 16 20 
[5,] 5 10 15 20 25 


The first sink opened a connection and created the output file (chap03.log). The 
cat and show functions printed results to output file. The second sink close the 
connection. 

Alternatively, as before, if we use the echo = TRUE option in the source 
function, everything is either printed to the console or output file. The sink con- 
nection determines what is printed to the output file. Here is the edited source file 
(chap03.R): 





i <- 1:5 

x <- outer(i, i, "*") 

sink ("/home/t ja/Documents/wip/epir/r/chap03.log") 
# Here are the results of the outer function 

x 


sink () 
Here we run source from the R command prompt: 


> source ("/home/tja/Documents/wip/epir/r/chap03.R", echo=T) 
>i <- 1:5 

> x <- outer(i, i, "*") 

> sink ("/home/t ja/Documents/wip/epir/r/chap03.log") 

> 


Nothing was printed to the console after the first sink because it was printed to 
output file (chap03.Rout). Here are the contents of chap03.Rout: 


> # Here are the results of the outer function 
> x 


[-1] [,2] [,3] [,41]1 [,95] 
[1,] 1 2 3 4 5 
[2,] 2 4 6 8 10 
[3] 3 6 9 12 15 
[4,] 4 8 12 16 20 
[5,] 5 10 15 20 25 
> sink () 
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The sink and capture.output functions accomplish the same task: send- 
ing results to an output file. The sink function works in pairs: open and closing 
the connection to the output file. In contrast, capture. output function appears 
once and only prints the last object to the output file. Here is the edited source file 
(chap03.R) using capture. output instead of sink: 


i <- 1:5 

x <- outer(i, i, "*") 

capture.output ( 

{ 

# Here are the results of the outer function 

x 

}, file = "/home/tja/Documents/wip/epir/r/chap03.1log") 


Here we run source from the R command prompt: 


> source ("/home/tja/Documents/wip/epir/r/chap03.R", echo=TRUE) 
>i <- 1:5 

> x <- outer(i, i, "*") 

> capture.output ( 

Ae 

+ # Here are the results of the outer function 

+ x 

+ }, file = "/home/tja/Documents/wip/epir/r/chap03.Rout") 





And, Here are the contents of chap03.log: 


eh 2d be Sy E41 159] 
[1,] p 2 3 4 5 
ai 2 4 6 8 10 
[3,] 3 6 9 12 15 
[4,] 4 @ G2: 2G. - Bo 
[5,] S BOs «tS: “BO” 25 


Even though the capture.output function can run several R commands (be- 
tween curly brackets), only the final object (x) was printed to the output file. This 
would be true even if the echo option was not set to TRUE in the source function. 

In summary, the source function runs R commands contained in an external 
source file. Setting the echo option to TRUE prints commands and results to the 
console as if the commands were directly entered at the command prompt. The 
sink and capture. output functions print results to an output file. 


3.8 Working with missing and “not available” values 


In R, missing values are represented by NA, but not all NAs represent missing values 
— some are just “not available.’ NAs can appear in any data object. The NA can 
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represent a true missing value, or it can result from an operation to which a value is 
“not available.” Here are three vectors that contain true missing values. 


x <- c(2, 4, NA, 5); x 

y <- eciM", NA, "M", "RM) ; y 

yo <= CCR NA, Ey 2) a 
However, elementary numerical operations on objects that contain NA return a sin- 
gle NA (“not available”). In this instance, R is saying “An answer is ‘not available’ 


until you tell R what to do with the NAs in the data object.” To remove NAs for a 
calculation specify the na. rm (“NA remove’) option. 


> sum(x) # answer not available 





[1] NA 
> mean(x) # answer not available 
[1] NA 
> sum(x, na.rm = TRUE) # better 
[1] 11 
> mean(x, na.rm = TRUE) # better 


[1] 3.666667 





Here are more examples where NA means an answer is not available: 


> # Inappropriate coercion 

> as.numeric(c("4", "six")) 

[1] 4 NA 

Warning message: 

NAs introduced by coercion 

> # Indexing out of range 

> c(1:5) [7] 

[1] NA 

> df <- data.frame(vl = 1:3, v2 = 4:6) 

> df£[4, ] # There is no 4th row 
vl v2 

NA NA NA 

> # Indexing with non-existing name 

> df["4th", ] 
vl v2 

NA NA NA 

> # Operations with NAs 

> NA + 8 

[1] NA 

> # Variance of a single number 

> var (55) 

[1] NA 


In general, these NAs indicate a problem that we must or should be addressed. 
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3.8.1 Testing, indexing, replacing, and recoding 


Regardless of the source of the NAs — missing or not available values — using 


the is.na function, we can generate a logical vector to 


identify which positions 


contain or do not contain NAs. This logical vector can be used index the original or 


another vector. Values that can be indexed can be replaced 


> x <- c(10, NA, 
> y <- c(NA, 24, NA, 47, NA) 

> is.na(x) #generate logical vector 
[1] FALSE TRUE FALSE TRUE FALSE 
> which(is.na(x)) #w 
[1] 2 4 

> x[!is.na(x) ] 
fl]: 20:33) .5:7 
> y[is.na(x) ] 
[1] 24 47 

> x[is.na(x) ] 
> x 


[1] 


33, NA, 57) 




















#index other vector 
<- 999 #replacement 
10 999 


33.5999) X57] 


#index original vector 


hich positions are NA 


For a vector, recoding missing values to NA is accomplished using replacement. 


> x <-- eC, -99, -3,- —-88, 5) 
> x[x==-99 | x==-88] <- NA 
> x 

[1] 1NA 3 NA 5 


For a matrix, we can recode missing values to NA by using 
at a time, or globablly like a vector. 





>m <-— m2 <- matrix (c(1, -99, 3, 4, 
>m 
[,1] [,2] [,3] 
[1,] 1 3 -88 
[2,] -99 4 5 
> # Replacement one column at a time 
> m[m[,1]==-99, 1] <- NA 
>m 
[,l] [,2] [,3] 
[1,] 1 3 -88 
[2,] NA 4 5 
> m[m[,3]==-88, 3] <- NA 
>m 
[,l] [,2] [,3] 
[1,] 1 3 NA 
[2,] NA 4 5 
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> # Global replacement 


> m2[m2==-99 | m2==-88] <- NA 
> m2 
[,1] [,2] [,3] 
[1,] 1 3 NA 
[2,] NA 4 5 


Likewise, for a data frame, we can recode missing values to NA by using replace- 
ment one field at a time, or globablly like a vector. 


> fname <- c("Tom", "Unknown", "Jerry") 
> age <- c(56, 34, -999) 
> zl <- z2 <- data.frame(fname, age) 
SEZ 
fname age 
Tom 56 
Unknown 34 
Jerry —999 
# Replacement one column at a time 
z1l$fname [z1Sfname=="Unknown"] <- NA 
z1lSage[zlSage==-999] <- NA 
zl 
fname age 
Tom 56 
<NA> 34 
Jerry NA 
# Global replacement 
z2[z2=="Unknown" | z2==-999] <- NA 
Z2 
fname age 
Tom 56 
<NA> 34 
3 Jerry NA 


VVVV WNP 


VV V WNP 


NR 


3.8.2 Importing missing values with the read. table function 


When importing ASCII data files using the read. table function, use the na. strings 
option to specify what characters are to be converted to NA. The default setting is 
na.strings="NA". Blank fields are also considered to be missing values in logi- 

cal, integer, numeric, and complex fields. For example, suppose the data set contains 

999, 888, and . to represent missing values, then import the data like this: 


mydat <- read.table("dataset.txt", na.strings = c(999, 888, 


If a number, say 999, represents a missing value in one field but a valid value in 
another field, then import the data using the as. is=TRUE option. Then replace 
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the missing values in the data frame one field at a time, and convert categorical field 
to factors. 


3.8.3 Working with NA values in data frames and factors 


There are several function for working with NA values in data frames. First, the 
na. fail function tests whether a data frame contains any NA values, returning an 
error message if it contains NAs. 








> name <-— c("Jose", "Ana", "Roberto", "Isabel", "Jen") 

> gender <- c("M", "EF", "M", NA, "E") 

> age <- c(34, NA, 22, 18, 34) 

> df <- data.frame(name, gender, age) 

> df 
name gender age 

1 Jose M 34 

2 Ana F NA 

3 Roberto M 22 

4 Isabel <NA> 18 

2 Jen F 34 

> na.fail(df) # NAs in data frame 

Error in na.fail.default (df) : missing values in object 

> na.fail(df[c(1, 3, 5),]) # no NAs in data frame 
name gender age 

1 Jose M 34 

3 Roberto M 22 

S Jen F 34 


Both na.omit and na.exclude remove observations for any field that contain 
NAs. na.exclude differs from na.omit only in the class of the “na.action” 
attribute of the result, which is “exclude” (see help for details). 


> na.omit (df) 
name gender age 


1 Jose M 34 
3 Roberto M 22 
5 Jen F 34 


> na.exclude (df) 
name gender age 


i. Jose M 34 
3 Roberto M 22 
2 Jen F 34 


The complete.cases function returns a logical vector for observations that are 
“complete” (i.e., do not contain NAs). 
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> complete.cases (df) 

[1] TRUE FALSE TRUE FALSE TRUE 

> df[complete.cases(df),] # equivalent to na.omit 
name gender age 




















I Jose M 34 
3 Roberto M 22 
5 Jen F 34 


3.8.3.1 NA values in factors 


By default, factor levels do not include NA. To include NA as a factor level, use the 
factor function, setting the exclude option to NULL. Including NA as a factor 
level enables counting and displaying the number of NAs in tables, and analyzing 
NA values in statistical models. 


> dfSgender 
[1] M F M <NA> F 
Levels: F M 
> xtabs(” gender, data = df) 
gender 
FM 
2 2 
> dfSgender.na <- factor(dfSgender, exclude = NULL) 
> xtabs(”~gender.na, data = df) 
gender.na 

F M <NA> 

2 2 1 


3.8.3.2 Indexing data frames that contain NAs 


Using the original data frame (that can contain NAs), we can index sujects with ages 
less than 25. 


> dfSage # age field 

[1] 34 NA 22 18 34 

> df[dfSage<25, ] # index ages < 25 
name gender age 

NA <NA> <NA> NA 

3 Roberto M 22 


4 Isabel <NA> 18 


The row that corresponds to the age that is missing (NA) has been converted to 
NAs (“not available”) by R. To remove this uninformative row we use the is.na 
function. 
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> df[dfSage<25 & !is.na(dfSage), ] 


name gender age 


3 Roberto M 22 


4 


Isabel <NA> 18 


This differs from the na.omit, na.exclude, and complete.cases func- 
tions that remove all missing values from the data frame first. 


3.8.4 Viewing number of missing values in tables 


By default, NAs are not tabulated in tables produced by the table and xtabs 
functions. The table function can tabulate character vectors and factors. The 
xtabs function only works with fields in a data frame. To tabulate NAs in charac- 
ter vectors using the table function, set the exclude function to NULL in the 
table function. 


> 
> 


dfS$gender.chr <- as.character (df$gender) 
dfSgender.chr 


[1] myn whe" uy" NA Eee" 


> 


FE 
2 
> 


table (df$gender.chr) 


M 

2 

table (df$gender.chr, exclude = NULL) 
F M <NA> 
2 2 ‘lt 


However, this will not work with factors: we must change the factor levels first. 


> 


EF 


VN 


VVN 


table(df$Sgender) #does not tabulate NAs 


M 
2 
table (dfS$gender, exclude = NULL) #does not work 





M 

2 

dfSgender.na <- factor(dfSgender, exclude = NULL) #works 
table (df$gender.na) 


Ee M <NA> 
2 2 1 


Finally, whereas the exclude option works on character vectors tabulated with 
table function, it does not work on character vectors or factors tabulated with the 
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xtabs function. In a data frame, we must convert the character vector to a factor 
(setting the exclude option to NULL), then the xtabs functions tabulates the 
NA values. 


> xtabs(~gender, data=df, exclude=NULL) # does not work 
gender 
FM 
2 2 
> xtabs(~gender.chr, data=df, exclude=NULL) # still does not work 
gender.chr 
FM 
2 2 
> dfSgender.na <- factor(dfS$gender, exclude = NULL) #works 
> xtabs(”~gender.na, data = df) 
gender.na 
F M <NA> 
2 2 1 


3.8.5 Setting default NA behaviors in statistical models 


Statistical models, for example the g1m function for generalized linear models, have 
default NA behaviors that can be reset locally using the na.action option in 
the gl1m function, or reset globally using the na.action option setting in the 
options function. 


> options("na.action") # display global setting 
Sna.action 
[1] "na.omit" 


> options(na.action="na.fail") # reset global setting 
> options ("na.action") 

Sna.action 

[1] "na.fail" 


By default, na. act ion is set to “na.omit” in the opt ions function. Globally (in- 
side the opt ions function) or locally (inside a statistical function), na. action 
can be set to the following: 


e “na.fail” 

e “na.omit” 

e “na.exclude” 

e “na.pass” 

With “na.fail”, a function that calls na. action will return an error if the data ob- 
ject contains NAs. Both “na.omit” and “na.exclude” will remove row observations 
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from a data frame that contain NAs.'* With na.pass, the data object is returned 
unchanged. 


3.8.6 Working with finite, infinite, and NaN numbers 


In R, some numerical operations result in negative infinity, positive infinity, or an 
indeterminate value (NAN for “not a number’). To assess whether values are finite 
or infinite, use the is. finite or is. infinite functions, respectively. To as- 
sess where a value is NAN, use the is.nan function. While is .na can identify 
NANs, is.nan cannot identify NAs. 


> x <- c(-2:2)/c(2, 0, 0, 0, 2) 
> x 

1 -1 -Inf NaN Inf 1 

> is.infinite (x) 

1] FALSE TRUE FALSE TRUE FALSE 
> x[is.infinite (x) ] 

1] -Inf Inf 

> is.finite (x) 

al; TRUE FALSE FALSE FALSE TRUE 
> x[is.finite(x) ] 

1] -1 1 

> is.nan (x) 

1 FALSE FALS 
> x[is.nan (x) ] 
1] NaN 
> is.na(x) # does index NAN 
1 FALSE FALSE TRUE FALSE FALSE 
> x[is.na(x) ] 

1] NaN 
> x[is.nan(x)] <- NA 
> x 
1 =l-SInet NA Inf 1 

> is.nan(x) # does not index NA 

1 FALSE FALSE FALSE FALSE FALSE 












































Tr 


TRUE FALSE FALSE 
























































!2 If na. omit removes cases, the row numbers of the cases form the “na.action” attribute of the 
result, of class “omit”. na.exclude differs from na.omit only in the class of the “na.action” 
attribute of the result, which is “exclude”. See help for more details. 
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Recorded data 


Calendar Calendar 
Date Date & Time 
“41/02/1959” “41/02/1959 11:14:24” 


as .Date strptime 


R date-time objects 


Date Class Date-Time Classes 
as.POSIX1t 
as.POSIXct 

1959-11-02 11:14:24 


Date 
1959-11-02 





Fig. 3.4 Displayed are functions to convert calendar date and time data into R date-time classes 
(as.Date, strptime, as.POSIX1t, as.POSIXct), and the format function converts 
date-time objects into character dates, days, weeks, months, times, etc. 


3.9 Working with dates and times 


There are 60 seconds in | minute, 60 minutes in | hour, 24 hours in 1 day, 7 days in 
1 week, and 365 days in | year (except every 4th year we have a leap year with 366 
days). Although this seems straightforward, doing numerical calculations with these 
time measures is not. Fortunately, computers make this much easier. Functions to 
deal with dates are available in the base, chron, and survival packages. 

Summarized in Figure 3.4 is the relationship between recorded data (calendar 
dates and times) and their representation in R as date-time class objects (Date, 
POSIX1t, POSIXct). The as. Date function converts a calendar date into a Date 
class object. The st rpt ime function converts a calendar date and time into a date- 
time class object (POSIXIt, POSIXct). The as. POSIX1t andas.POSIXct func- 
tions convert date-time class objects into POSTX1t and POSIXct, respectively. 

The format function converts date-time objects into human legible character 
data such as dates, days, weeks, months, times, etc. These functions are discussed 
in more detail in the paragraphs that follow. 
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3.9.1 Date functions in the base package 


3.9.1.1 The as . Date function 


Let’s start with simple date calculations. The as . Date function in R converts cal- 
endar dates (e.g., 11/2/1949) into a Date objects—a numeric vector of class Date. 
The numeric information is the number of days since January 1, 1970—also called 
Julian dates. However, because calendar date data can come in a variety of formats, 
we need to specify the format so that as . Date does the correct conversion. Study 
the following analysis carefully. 


> bdays <- c("11/2/1959", "1/1/1970") 

> bdays 

(1) "1L/ 2/1959" 81/1/1970" 

> #convert to Julian dates 

> bdays.julian <- as.Date(bdays, format = "%m/%d/%Y") 
> bdays. julian 

[1] "1959-11-02" "1970-01-01" 





Although this looks like a character vectors, it is not: it is class “Date” and mode 
“numeric’”’. 


> #display Julian dates 

> as.numeric(bdays. julian) 

[1] -3713 0 

> #calculate age as of today’s date 

> date.today <- Sys.Date() 

> date.today 

[1] "2005-09-25" 

> age <- (date.today - bdays. julian) /365.25 
> age 

Time differences of 45.89733, 35.73169 days 
> #the display of ‘days’ is not correct 

> #truncate number to get "age" 

> age2 <- trunc(as.numeric (age) ) 


> age2 

[1] 45 35 

> #create date frame 

> bd <- data.frame (Birthday = bdays, Standard = bdays.julian, 
+ Julian = as.numeric(bdays.julian), Age = age2) 

> bd 


Birthday Standard Julian Age 
1 11/2/1959 1959-11-02 -3713 45 
2 1/1/1970 1970-01-01 0 35 


To summarize, as .Date converted the character vector of calendar dates into 
Julian dates (days since 1970-01-01) are displayed in a standard format (yyyy-mm- 
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dd). The Julian dates can be used in numerical calculations. To see the Julian dates 
use aSs.numeric or julian function. Because the calendar dates to be converted 
can come in a diversity of formats (e.g., November 2, 1959; 11-02-59; 11-02-1959; 
02Nov59), one must specify the format option in as.Date. Below are selected 
format options; for a complete list see help (st rptime). 


woe i 
woaN 
" Sb" 
WoBW 
woq" 
"sy" 


i " 


10 
om 


"W su" 


" " 


° 
“oW 


"W sw" 


"Sv 


" sy" 

















Abbreviated weekday name. 

Full weekday name. 

Abbreviated month name. 

Full month name. 

Day of the month as decimal number (01-31) 

Day of year as decimal number (001-366). 

Month as decimal number (01-12). 

Week of the year as decimal number (00-53) using the 
first Sunday as day 1 of week 1. 

Weekday as decimal number (0-6, Sunday is 0). 

Week of the year as decimal number (00-53) using the 
first Monday as day 1 of week 1. 


Year wit 


which century you get is system-specific. 
Often values up to 69 (or 68) 
70-99 by 19. 


Year wit 


hout century (00-99). 





h century. 


If you use this on input, 


So don’t! 


are prefixed by 20 and 


Here are some examples of converting dates with different formats: 





> as.Date("November 2, 1959", format = "SB %d, %Y") 
1} "1959-11-02" 
> as.Date("11/2/1959", format = "%Sm/%d/%Y") 
1] "1959-11-02" 
> #caution using 2-digit year 
> as.Date("11/2/59", format = "%Sm/%d/%Sy") 
1} "2059-11-02" 
> as.Date("O2Nov1959", format = "Sd%bsY") 
1} "1959-11-02" 
> #caution using 2-digit year 
> as.Date("02Nov59", format = "Sd%b%Sy") 
1} "2059-11-02" 
> #standard format does not require format option 
> as.Date ("1959-11-02") 
1]: "£959-141:-0.2" 














Notice how Julian dates can be used like any integer: 


> as.Date ("2004-01-15") :as.Date ("2004-01-23") 


[1] 


> seq(as.Date ("2004-01-15"), 


[1] 
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"2004- 


01-15" "2004-01-16" 


14-Oct-2013 


12432 12433 12434 12435 12436 12437 12438 12439 12440 
as.Date ("2004-01-18"), 


by 
"2004-01-18" 


1) 
"2004-01-17" 


© Tomas J. Aragon (www.medepi.com) 


154 3 Managing epidemiologic data in R 
3.9.1.2 The weekdays, months, quarters, julian functions 


Use the weekdays, months, quarters, or julian functions to extract infor- 
mation from Date and other date-time objects in R. 


> mydates <- c("2004-01-15", "2004-04-15", "2004-10-15") 
> mydates <- as.Date(mydates) 

> weekdays (mydates) 

[1] "Thursday" "Thursday" "Friday" 
> months (mydates) 

[1] "January" "April" "October" 
> quarters (mydates) 

[1] "Qi" "Q2" "Q4" 

> julian (mydates) 

[1] 12432 12523 12706 
attr(,"origin") 

[1] "1970-01-01" 


3.9.1.3 The strptime function 


So far we have worked with calendar dates; however, we also need to be able to 
work with times of the day. Whereas as . Date only works with calendar dates, the 
st rptime function will accept data in the form of calendar dates and times of the 
day (HH:MM:SS, where H = hour, M = minutes, S = seconds). For example, let’s 
look at the Oswego foodborne ill outbreak that occurred in 1940. The source of the 
outbreak was attributed to the church supper that was served on April 18, 1940. The 
food was available for consumption from 6 pm to 11 pm. The onset of symptoms 
occurred on April 18th and 19th. The meal consumption times and the illness onset 
times were recorded. 


> odat <- read.table("http://www.medepi.net/data/oswego.txt", 








+ sep = "", header = TRUE, as.is = TRUE) 
> str (odat) 
‘data.frame’: 75 obs. of 21 variables: 
S$ id : int 23 4678 910 14 16 
S$ age : int 52 65 59 63 70 40 15 33 10 32 
S$ sex : chr whee uy" wee wR ict 
S$ meal.time : chr "8:00 PM" "6:30 PM" "7:30 PM" 
$ ill : chr wy" mya neyo Wyo Ae 
S$ onset.date >: chr "4/19" "4/19" "4/19" "4/18" 
S$ onset.time : chr "12:30 AM" "10:30 PM" 
S$ vanilla.ice.cream : chr "y" "y™ "Ny" wy" 
S$ chocolate.ice.cream: chr "N" "y™ "Ny" "yn 
S$ fruit.salad : Chr "N" TNT UN MNT 
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To calculate the incubation period, for ill individuals, we need to subtract the 
meal consumption times (occurring on 4/18) from the illness onset times (occurring 
on 4/18 and 4/19). Therefore, we need two date-time objects to do this arithmetic. 
First, let’s create a date-time object for the meal times: 


> # look at existing data for meals 

> odat$meal.time[1:5] 

[1] "8:00 PM" "6:30 PM" "6:30 PM" "7:30 PM" "7:30 PM" 

> # create character vector with meal date and time 

> mdt <- paste("4/18/1940", odatSmeal.time) 

> mdt [1:4] 

[1] "4/18/1940 8:00 PM" "4/18/1940 6:30 PM" 

[3] "4/18/1940 6:30 PM" "4/18/1940 7:30 PM" 

> # convert into standard date and time 

> meal.dt <- strptime(mdt, format = "%Sm/%d/%SY %1:%3M %p") 
> meal.dt[1:4] 

[1] "1940-04-18 20:00:00" "1940-04-18 18:30:00" 

[3] "1940-04-18 18:30:00" "1940-04-18 19:30:00" 

> # look at existing data for illness onset 

> odatSonset.date[1:4] 

[LL] “4/19" "4/19" "4/19" "4/18" 

> odatSonset.time[1:4] 

[1] "12:30 AM" "12:30 AM" "12:30 AM" "10:30 PM" 

> # create vector with onset date and time 

> odt <- paste (paste (odatSonset.date, "/1940", sep=""), 
+ odatSonset.time) 

> odt [1:4] 

[1] "4/19/1940 12:30 AM" "4/19/1940 12:30 AM" 

[3] "4/19/1940 12:30 AM" "4/18/1940 10:30 PM" 

> # convert into standard date and time 

> onset.dt <- strptime(odt, "%Sm/%Sd/%Y %SI:%3M %p") 

> onset.dt[1:4] 

[1] "1940-04-19 00:30:00" "1940-04-19 00:30:00" 

[3] "1940-04-19 00:30:00" "1940-04-18 22:30:00" 

> # calculate incubation period 

> incub.period <- onset.dt - meal.dt 

> incub.period 

Time differences of 4.5, 6. 
6.5, NA, NA, NA, NA, 3. 








Oy. 607. SO “S305. “625i, 3507, “4:50; 
0, NA, NA, NA, 3.0, NA, NA, 


NA, NA, NA, NA, NA, NA, NA hours 


> mean(incub.period, na.rm = T) 

Time difference of 4.295455 hours 

> median(incub.period, na.rm = T) 

Error in Summary.difftime(..., na.rm = na.rm) 





sum not defined for "difftime" objects 
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> # try ’as.numeric’ on ‘’incub.period’ 
> median(as.numeric(incub.period), na.rm = T) 
[1] 4 


To summarize, we used st rpt ime to convert the meal consumption date and 
times and illness onset dates and times into date-time objects (meal.dt and 
onset .dt) that can be used to calculate the incubation periods by simple sub- 
traction (and assigned name incub. period). 

Notice that incub. period is an atomic object of class dif ft ime: 


> str (incub.period) 

Class ’difftime’ atomic [1:75] 4.5 6 6 3 3 6.5 3 4 NA 
-- attr(*«, "tzone")= chr ™" 
-- attr(*, "units")= chr "hours" 


This is why we had trouble calculating the median (which should not be the case). 
We got around this problem by coercion using as. numeric 


> as.numeric(incub.period) 
[1] 4.5 6.0 6.0 3.0 3.0 6.5 3.0 4.0 6.5 NA NA NA NA 3.0 


Now, what kind of objects were created by the st rpt ime function? 


> str(meal.dt) 
'POSIX1t’, format: chr [1:75] "1940-04-18 18:30:00" 
> str(onset.dt) 
'POSITX1t’, format: chr [1:75] "1940-04-19 00:30:00" 





The strptime function produces a named list of class POSIX1t. POSIX 
stands for “Portable Operating System Interface,’ and “It” stands for “legible 


time”.!3 


3.9.1.4 The POSIX1t and POSIxXct functions 


The POSIXiIt list contains the date-time data in human readable forms. The named 
list contains the following vectors: 


‘sec’ 0-61: seconds 

‘min’ 0-59: minutes 

‘hour’ 0-23: hours 

'mday’ 1-31: day of the month 

‘mon’ O-11: months after the first of the year. 
‘year’ Years since 1900. 


‘wday’ 0-6 day of the week, starting on Sunday. 
‘yday’ 0-365: day of the year. 


'3 For more information visit the Portable Application Standards Committee site at 


http://www.pasc.org/ 
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‘isdst’ Daylight savings time flag. Positive if in force, 
zero if not, negative if unknown. 


Let’s examine the onset .dt object we created from the Oswego data. 


> is.list (onset.dt) 
[1] TRUE 

> names (onset.dt) 
[1] "sec" 
[8] "yday" 


"min" 


"isdst" 


"hour" "mday" "mon" "year" "wday" 


> O 
[1 
> O 
[1 
> O 
[1 
> O 
[1 
> O 
[1 
> O 
[1 
> O 
[1 


nset.d 


tSmin 


30 


nset 
0 


19 


nset 


nset 
40 


nset 





nset 





30 30 30 30 


nset. 


.d 


Q. 


.d 


.d 


4 


.d 


.d 
109 109 109 108 108 109 109 108 109 109 109 108 108 109 


t Shour 
0. 0-22 122, 


t Smday 
9 19 18 18 


tS$mon 


3 8F 30 3 


tSyear 
0 40 40 40 


tSwday 


5 5 4 4 





tSyday 


19 19 18 19 


40 40 40 40 


O AB 4 eS 


30 30 


10 


19 19 


40 40 


3. 5 


15 


22 


18 


40 


4 


22-123 


18 19 18 


40 40 40 


4 5 4 


45 


21 


18 


40 


4 


45 


21 


18 


40 


4 


The POSIXIt list contains useful date-time information; however, it is not in a 


19 


40 


5 


convenient form for storing in a data frame. Using as .POSIXct we can convert it 
to a “continuous time” object that contains the number of seconds since 1970-01-01 


00:00:00. as .POSIX1t coerces a date-time object to POSIXIt. 


> onset.dt.ct 


<- as.POSIXct (onset.dt) 








> onset.dt.ct[1:5] 

[1] "1940-04-19 00:30:00 Pacific Daylight Time" 
[2] "1940-04-19 00:30:00 Pacific Daylight Time" 
[3] "1940-04-19 00:30:00 Pacific Daylight Time" 
[4] "1940-04-18 22:30:00 Pacific Daylight Time" 
[5] "1940-04-18 22:30:00 Pacific Daylight Time" 
> as.numeric(onset.dt.ct[1:5]) 

[ 
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14-Oct-2013 


-—937326600 -—937326600 -—937326600 -—937333800 
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3.9.1.5 The format function 


Whereas the st rpt ime function converts a character vector of date-time informa- 
tion into a date-time object, the format function converts a date-time object into a 
character vector. The format function gives us great flexibility in converting date- 
time objects into numerous outputs (e.g., day of the week, week of the year, day of 
the year, month of the year, year). Selected date-time format options are listed on 
page 153, for a complete list see help (st rptime). 

For example, in public health, reportable communicable diseases are often re- 
ported by “disease week” (this could be week of reporting or week of symptom 
onset). This information is easily extracted from R date-time objects. For weeks 
starting on Sunday use the “%U” option in the format function, and for weeks 
starting on Monday use the “%W” option. 


> decjan <- seq(as.Date ("2003-12-15"), as.Date("2004-01-15"), 
+ by =1) 

> decjan 

[1] "2003-12-15" "2003-12-16" "2003-12-17" "2003-12-18" 

29] "2004-01-12" "2004-01-13" "2004-01-14" "2004-01-15" 
disease.week <- format (decjan, "%SU") 

disease.week 

fide), S50 =F 5 OP NSO SOM TSO SO NS TA eb Mee 5m 
[2 Set SL 5 Ot A 2 MBO IG DOO NOON VOOM ORT OM 
E23) SOLS MO OAs MOI OA OZ OQ MOA WO QI OZ 


[ 
> 
> 





3.9.2 Date functions in the chronand survival packages 


The chron and survival packages have customized functions for dealing with 
dates. Both packages come with the default R installation. To learn more about date 
and time classes read R News, Volume 4/1, June 2004.!4 


3.10 Exporting data objects 
On occassion, we need to export R data objects. This can be done in several ways, 


depending on our needs: 


e Generic ASCII text file 
e R ASCII text file 
e R binary file 


4 http://cran.r-project .org/doc/Rnews 
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Table 3.5 R functions for exporting data objects 








Function Description Try these examples 

Export to Generic ASCII text file 

write.table Write tabular data as a data write.table(infert, "infert.dat") 
write.csv frame to an ASCII text file; write.csv(infert, "infert.csv") 


read file back in using 
read.table function 


write Write matrix elements to an x <= matrix(1:4,; 2,. 2) 
ASCII text file write(t (x), "x.txt") 
Export to R ASCII text file 
dump “Dumps” list of R objects asR dump("Titanic", "titanic.R") 


code to an ASCII text file; read 
back in using source 
function 
dput Writes an R objectasR code dput (Titanic, "titanic.R") 
(but without the object name) 
to the console, or an ASCII 
text file; read file back in using 
dget function 


Export to R binary file 


save “Saves” list of R objects as save(Titanic, "titanic.Rdata") 
binary filename .Rdata file; 
read back in using load 


function 
Export to non-R ASCII text file 
write.foreign From foreign package: write.foreign(infert, 


writes text files (SPSS, Stata, datafile="infert.dat", 
SAS) and code to read them codefile="infert.txt", package 


= "SPSS") 
Export to non-R binary file 
write.dbf From foreign package: write.dbf (infert, 
writes DBF files "infert.dbf") 
write.dta From foreign package: write.dta(infert, 
writes files in Stata binary "infert.dta") 


format 





e Non-R ASCII text files 
e Non-R binary file 
3.10.1 Exporting to a generic ASCTI text file 


3.10.1.1 The write.table function 


We use the write.table function to exports a data frame as a tabular ASCII text 
file which can be read by most statistical packages. If the object is not a data frame, 
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it converts to a data frame before exporting it. Therefore, this function only make 
sense with tabular data objects. Here are the default arguments: 


> args (write.table) 

function (x, file = "", append = FALSE, quote = TRUE, 
sep =" ", eol = "\n", na = "NA", dec = ".", 
row.names = TRUE, col.names = TRUE, 








qmethod = c("escape", "double") ) 


The Ist argument will be the data frame name (e.g., infert), the 2nd will be the name 
for the output file (e.g., infert.dat), the sep argument is set to be space-delimited, 


and the row. names argument is set to TRUE. 
The following code: 





write.table(infert, "infert.dat") 


produces this ASCII text file: 


"education" "age" "parity" "induced" "case" 
"i" “Q-Syrs" 26 6 1 © 2.1.3 
"2" "Q-Syrs" 42111021 
"3" "Q-5yrs" 39 62103 4 
"4" "Q-5yrs" 34 42104 2 


wo "6-lives™. 35.3.4, <1 1-5: 32 


Because row.names=TRUE, the number field names in the header (row 1) will 
one less that the number of columns (starting with row 2). The default row names is 
a character vector of integers. The following code: 








GI 
~~ 


write.table(infert,"infert.dat", sep=",", row.names=FALS! 
produces a commna-delimited ASCII text file without row names: 


"education", "age", "parity", "induced","case", 
NO=5 Vrs", 26,7671, 1727173 

NOS rs" AQ let 10,2 
NO=5 VES", 3.9: 167-27 1-073 
"0-5yrs",34,4,2,1,0,4, 
"6-Li yrs", 35/3; 1,27 1j7-5; 32 


v 


x 


1 
4 
2 
Note that the write.csv function produces a comma-delimited data file by de- 
fault. 

3.10.1.2 The write function 

The write function writes the contents of a matrix in a columnwise order to an 


ASCII text file. To get the same appearance as the matrix, we must transpose the 
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matrix and specify the number of columns (the default is 5). If we set file="", 
then the output is written to the screen: 


> infert.tabl <- xtabs(~casetparity, data=infert) 
> infert.tabl 
parity 
case 1 2 3 4 5 6 
0 66 54 2412 4 5 
1 S327? V2. (6) <2t. 83 
> write(infert.tabl, file="") #not what we want 
66 33 54 27 24 
12 12 6 4 2 
5.3 
> #much better 
> write(t(infert.tabl), file="", ncol=ncol (infert.tabl1) ) 
66 54 24 12 4 5 
83: 02:7..12>6.-2- 3 


To read the raw data back into R, we would use the scan function. For example, 
if the data had been written to data.txt, then the following code reads the data 
back into R: 





Gl 
~~ 


> matrix(scan("data.txt"), ncol=6, byrow=TRUI 
Read 12 items 
bee ede be 3 Ad Leo) be 6 
[Ly] 66 54 24 V2 4 5 
25] 33 27 Le 6 2 3 


Of course, all the labeling was lost. 


3.10.2 Exporting to R ASCII text file 


Data objects can also be exported as R code in an ASCII text file using the dump 
and dput functions. This has advantages for complex R objects (e.g., arrays, lists) 
that do not have simple tabular structures, and the R code makes the raw data human 
legible. 


3.10.2.1 The dump function 
The dump function exports multiple objects as R code as the next example illus- 
trates: 


> infert.tabl <- xtabs(~casetparity, data=infert) 
> infert.tab2 <- xtabs (~educationtparitytcase, data=infert) 
> infert.tabl #display matrix 
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parity 
case 1 2 3 4 5 6 

0 66 54 2412 4 5 

133) 279" U2 “6x <2:~ *3 
> infert.tab2 #display array 
, , case = 0 


parity 
education 1 2 3 
O-5yrs 2. iQ =O 
6-llyrs 28 28 14 
12+ yrs 36 26 10 


NM oN A 
NN OW 
POW aD 


, , case = 1 


parity 
education 1 2 
O-5yrs 1 0 
6-llyrs 14 14 
12+ yrs 18 13 


a~atoO WwW 
Pos Pop 
PROW 
FPONO 


> dump(c("infert.tab1", "infert.tab2"),"infert_tab.R") #export 
The dump function produced the following R code in the infert_tab.R text file. 


‘infert.tabl* <- 
structure(c(66, 33, 54, 27, 24, 12, 12, 6, 4, 2, 5, 3), 
-Dim = c(2L, 6L), .Dimnames = structure(list (case 
parity = c("1", "2", "3", "4", "5", "6")), .Names = c("case", 
"parity")), class = c("xtabs", "table"), call = quote (xtabs ( 
formula = “case + parity, data = infert))) 

‘infert.tab2* <- 

structure(c(2, 28, 36, 0, 28, 26, 0, 14, 10, 2, 8, 2, 0, 2, 
5: Aye Oy. Vip A Tage 18; Op 4g sy 07 Vo By Ve 43 As. O70 Lee A, 
2, 0, 1), .Dim = c(3L, 6L, 2L), .Dimnames = structure (list ( 
education = c("0-5yrs", "6-llyrs", "12+ yrs"), parity = c("1", 
wae, "3m, "4am, "5", "6"), case = c("0", "1")), .Names 





eC" 0"; MAL) 


c("education", "parity", "case")), class = c("xtabs", "table"), 


call = quote(xtabs (formula = “education + parity + case, 
data = infert))) 


Notice that the Ist argument was a character vector of object names. The infert_tab.R 
file can be run in R using the source function to recreate all the objects in the 
workspace. 
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3.10.2.2 The dput function 


The dput function is similar to the dump function except that the object name is 
not written. By default, the dput function prints to the screen: 


> dput (infert.tabl) 

structure(c(66, 33, 54, 27, 24, 12, 12, 6, 4, 2, 5, 3), 

-Dim = c(2L, 6L), .Dimnames = structure(list (case aco" PL"); 
parity =c(™1", "2", "3", "4", "5", "6")), .Names = c("case", 
"parity")), class = c("xtabs", "table"), call = quote (xtabs ( 
formula = “case + parity, data = infert))) 


To export to an ASCII text file, give a new file name as the second argument, similar 
to dump. To get back the R code use the dget function: 


> dput (infert.tabl, "infert_tab1.R") 
> dget ("infert_tab1.R") 

parity 
case 1 2 3 4 5 6 

0 66 54 2412 4 5 

1..33° 27 A226: -25 <3 


3.10.3 Exporting to R binary file 


3.10.3.1 The save function 


The save function exports R data objects to binary file (filename .RData) which is 
the most effient, compact method to export objects. The first argument(s) can be the 
names of the objects to save followed by the output file name, or list with a character 
vector of object names followed by the output file name. Here is an example of the 
first option: 


BORGO. Ll Bi ay Ke RS 

> save(x, y, file="xy.RData") 
> rm(x, y) 

> 1s() 

character (0) 





> load(file="xy.RData") 
> Is() 
[1 Wisz A yee 


Notice that we used the load function to load the binary file back into the 
workspace. 


Now here is an example of the second option using a list: 


Doe So Ley ey. Sh ax 3 
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> save(list=c("x", "y"), file="xy.RData") 
> rm(x, y) 

> 1s() 

character (0) 





> load(file="xy.RData") 
> Is() 
Ea Weszit Maypn 


In fact, the save. image function we use to save the entire workspace is just the 
following: 


save(list = ls(all=TRUE), file = ".RData") 





3.10.4 Exporting to non-R ASCII text and binary files 


The foreign package contains functions for exporting R data frames to non-R 
ASCII text and binary files. The write. foreign function write two ASCII text 
files: the first file is the data file, and the second file is the code file for reading 
the data file. The code file contains either SPSS, Stata, or SAS programming code. 
The write.dbf function writes a data frame to a binary DBF file, which can be 
read back into R using the read. dbf function. Finally, the write. dta function 
writes a data frame to a binary Stata file, which can be read back into R using the 
read.dta function. 


3.11 Working with regular expressions 


A regular expression is a special text string for describing a search pattern which 
can be used for searching text strings, indexing data objects, and replacing object 
elements. For example, we applied Global Burden of Disease methods to evaluate 
causes of premature deaths in San Francisco [4]. Using regular expressions we were 
able to efficiently code over 14,000 death records, with over 900 ICD-10 cause of 
death codes, into 117 mutually exclusive cause of death categories. Without regular 
expressions, this local study would have been prohibitively tedious. 

A regular expression is built up from specifying one character at a time. Using 
this approach, we cover the following: 


Single character: matching a single character; 

Character class: matching a single character from among a list of characters; 
Concatenation: combining single characters into a new match pattern; 
Repetition: specifying how many times a single character or match pattern might 
be repeated; 

e Alternation: a regular expression may be matched from among two or more reg- 
ular expressions; and 
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e Metacharacters: special characters that require special treatment. 


3.11.1 Single characters 


The search pattern is built up from specifying one character at a time. For example, 
the pattern "x" looks for the letter x in a text string. Next, consider a character 
vector of text strings. We can use the grep function to search for a pattern in this 
data vector. 


> Vvecl <= ex", “xa bae™, Yabo™;- “ax be"; “ab oxe®;. Mabinex") 
> grep("x", vecl) 

[1] 1245 6 

> vecl[grep("x", vecl)] #index by position 

[1] W5.0 "xa bc" "ax bc" "ab xc" "ab cx" 


The grep function returned an integer vector indicating the positions in the data 
vector that contain a match. We used this integer vector to index by position. 

The caret ~ matches the empty string at the beginning of a line. Therefore, to 
match this pattern at the beginning of a line we add the ~ character to the regular 
expression: 


> grep("*x", vecl) 


fly 2 
> vecl[grep("*x", vecl)] #index by position 
[1] Wisgitt "xa bc" 


The $ character matches the empty string at the end of a line. Therefore, to match 
this pattern at the end of a line we add the $ character to the regular expression: 


> vecl[grep("x$", vecl)] #index by position 
[1] Wg tt "ab cx" 


The * and $ characters are examples of metacharacters (more on these later). 
To match this pattern at the beginning of a word, but not the beginning of a line, 
we add a space to the regular expression: 


> vecl[grep(" x", vecl)] #index by position 
[1] "ab xc" 


To match this pattern at the end of a word, but not the end of a line, we add a space 
to the regular expression: 


> vecl[grep("x ", vecl)] #index by position 
[1] "ax be" 


The period “.” matches any single character, including a space. 


> vecl[grep(".bc", vecl) ] 
[1] "xa bc" "abc" "ax bc" 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


166 3 Managing epidemiologic data in R 


Table 3.6 Predefined character classes for regular expressions 








Predefined Description Alternative 
{[:lower:]] Lower-case letters in the current locale [a-z] 
[[:upper:]] Upper-case letters in the current locale [A-Z] 
[[:alpha:]] Alphabetic characters [A-Za-z] 
[[:digit:]] Digits [0-9] 
[{[:alnum:]] Alphanumeric characters [A-Za-z0-9] 
[[:punct:]] 


Punctuation characters: ! " # $ DV sos ee SI CS and:= 


% '" ( ) * +, -— . / : } require special placement in character 
<=>?7€@[\]7*-* {| classes. See p. 171) 
} 


[[:space:]] Space characters: tab, newline, 

vertical tab, form feed, carriage return, 

and space 
[[:graph:]] Graphical characters [[:alnum:] [:punct:]] 
[[:print:]] Printable characters [[:alnum:] [:punct:][:space:]] 
[[:xdigit: ] ] Hexadecimal digits: [O-9A-Fa-f] 





3.11.2 Character class 


A character class is a list of characters enclosed by square brackets [ and ] which 
matches any single character in that list. For example, "[fhr]" will match the 
single character £, h, or r. This can be combined with metacharacters for more 
specificity; for example, "* [fhr]™" will match the single character f, h, or r at 
the beginning of a line. 


> vec2 <- eC rat; "hak; Mat Neer, "mach", "hat!) 
> grep("“[fhr]", vec2) 

[ly 1. “3.6 

> vec2[grep("*“[fhr]", vec2)] #index by position 

[1] "Fat" "rat" "hat" 


As already shown, ~ character matches the empty string at the beginning of a 
line. However, when ~ is the first character in a character class list, it matches any 
character not in the list. For example, "* [~fhr]" will match any single character 
at the beginning of a line except f, h, or r. 


> vec2 <- CuACBat,™, "bar, “rat; "elf", "mach", "hat ™) 
> vec2[grep("*“[*fhr]", vec2)] #index by position 
[1] "bar" Wal E™ "mach" 


Character classes can be specified as a range of possible characters. For example, 
[0-9] matches a single digit with possible values from 0 to 9, [A—Z] matches a 
single letter with possible values from A to Z, and [a—z] matches a single letter 
from a to z. The pattern [0-9A-—Za-—z] matches any single alphanumeric character. 
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For convenience, certain character classes are predefined and their interpreta- 
tion depend on locale.!> For example, to match a single lower case letter we place 
[: lower: ] inside square brackets like this: "[ [: lower: ]]", which is equiv- 
alent to "[a-z]". Table 3.6 on the facing page lists predefined character classes. 
This is very convenient for matching punctuation characters and multiple types of 
spaces (e.g., tab, newline, carriage return). 

In this final example, "~.[*a].” will match any first character, followed by 
any character except a, and followed by any character one or more times: 


> vec2[grep("*.[*a].+", vec2)] #index by position 
[1] "elf" 


Combining single character matches is call concatenation. 


3.11.3 Concatenation 


Single characters (including character classes) can be concatenated; for example, 
the pattern "* [fhr] at $" will match the single, isolated words fat, hat, or rat. 


> vec3 <- c("fat", "bar", "rat", "fat boy", "elf", "mach", 
+ "hat") 

> vec3[grep("*[fhr]at$", vec3)] #index by position 

[1] "Fat" "rat" "hat" 


The concatenation "[ct]a[br]" will match the pattern that starts with c or ¢, 
followed by a, and followed by b or r. 


> vec4 <- c("cab", "carat", "tar", "bar", "tab", "batboy", 
+ "care") 

> vec4[grep("[ct]a[br]", vec4)] #index by position 

[1] "cab" "carat" wean" "tab" "care" 


To match single, 3-letter words use "* [ct ]a[br]$". 


> vec4[grep("*[ct]a[br]$", vec4)] #index by position 
[1] "cab" Weare" "rab" 


The period (.) is another metacharacter: it matches any single character. For 
example, "f.t" matches the pattern f + any character +t. 


> vec5 <- c("fate", "rat", "fit", "bat", "futbol") 
> vecS[grep("f.t", vec5)] #index by position 
[1] "fate" Ree! "futbol" 


‘5 The locale describes aspects of the internationalization of a program. Initially most aspects of 
the locale of R are set to “C” (which is the default for the C language and reflects North-American 
usage). 
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Table 3.7 Regular expressions may be followed by a repetition quantifier 





Repetition quantifier Description 





? Preceding pattern is optional and will be matched at most once 
- Preceding pattern will be matched zero or more times 
+ Preceding pattern will be matched one or more times 
{n} Preceding pattern is matched exactly n times 
{n,} Preceding pattern is matched n or more times 
{n, m} Preceding pattern is matched at least n times, but not more than m times 





3.11.4 Repetition 


Regular expressions (so far: single characters, character classes, and concatenations) 
can be qualified by whether a pattern can repeat (Table 3.7). For example, the pattern 
"“£.4+t$" matches single, isolated words that start with f or F, followed by | or 
more of any character, and ending with f. 


> vec6 <- c("fat", "fate", "feat", "bat", "Fahrenheit", 
+ "foot") 

> vec6[grep("*“[fF].+t$", vec6)] #index by position 

[Ll] “tat™ "feat" "Fahrenheit" "foot" 


Repetition quantifiers gives us great flexibility to specify how often preceding pat- 
terns can repeat. 


3.11.5 Alternation 


Two or more regular expressions (so far: single characters, character classes, con- 
catenations, and repetitions) may be joined by the infix operator |. The resulting 
regular expression can match the pattern of any subexpression. For example, the 
World Health Organization (WHO) Global Burden of Disease (GBD) Study used 
International Classification of Diseases, 10th Revision (ICD-10) codes (ref). The 
GBD Study ICD-10 codes for hepatitis B are the following: 


B16, B16.0, B16.1, B16.2, B16.3, B16.4, B16.5, B16.7, B16.8, B16.9, B17, B17.0, B17.2, 
B17.8, B18, B18.0, B18.1, B18.8, B18.9 


Notice that B16 and B16.0 are not the same ICD-10 code! The GBD Study methods 
were used to study causes of death in San Francisco, California (ref). Underlying 
causes of death were obtained from the State of California, Center for Health Statis- 
tics. The ICD-10 code field did not have periods so that the hepatitis B codes were 
the following. 


B16, B160, B161, B162, B163, B164, B165, B167, B168, B169, B17, B170, B172, B178, 
B18, B180, B181, B188, B189 
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To match the pattern of ICD-10 codes representing hepatitis B, the following 
regular expression was used (without spaces): 


"*Bl16[ ]?$|°B17[0,2,8]?S$|*B18[0,1,8,9]?S" 


This regular expression matches ~B16 [0-9] ?Sor*B17[0,2,8]?Sor7B18[0,1,8,9]?$ 
Similar to the first and third pattern, the second regular expression, ~B17[0,2,8]?5S, 
matches B17, B170, B172, or B178 as isolated text strings. 
To see how this works, we can match each subexpression individually and then 
as an alternation: 


> hepb <- c("B16", "B160", "B161", "B162", "B163", "B164", 


+ "BLES; "B1l67", "B168", "B169", "Bly "BIO", 
. HBT Beer, VEEN, MBLEO". NBL). MELER™, 
. "B189") 


> grep ("*“B16[0-9]?S5", hepb) #match 1st subexpression 
[1] P82 BS 4 Be 6 Be 9 LO 
> grep("*B17[0,2,8]?5", hepb) #match 2nd subexpression 
[1] 11 12 13 14 
> grep ("*B18[0,1,8,9]?S5", hepb) #match 3rd subexpression 
[1] 15 16 17 18 19 
> #match any subexpression 
> grep ("*B16 [0-9] ?$|*B17[0,2,8]?$|*B18[0,1,8,9]?$", hepb) 
[1] 1 2 3 4 5 6 7 8 91011 12 13 14 15 16 17 18 19 


A natural use for these pattern matches is for indexing and replacement. We 
illustrate this using the 2nd subexpression. 


> #indexing 

> hepb[grep("*B17[0,2,8]?S", hepb) J 

fae BL7™ "BLO" "BT?" "BLY Be 

> #replacement 

> hepb[grep("*B17[0,2,8]?S", hepb)] <- "HBV" 


> hepb 
[Ly "B16" "B160" "B161" "B162" "B163" "B164" "B165" "B167" 
[9] "B168" "B169" "HBV" "HBV" "HBV" "HBV" "B18" "B180" 


[17] "B181" "B188" "B189" 


Using regular expression alternations allowed us to efficiently code over 14,000 
death records, with over 900 ICD-10 cause of death codes, into 117 mutually ex- 
clusive cause of death categories for our San Francisco study. Suppose sfdat was 
the data frame with San Francisco deaths for 2003-2004. Then the following code 
would tabulate the deaths caused by hepatatis B: 


> sfdatShepb <- rep("No", nrow(sfdat)) #new field 

> get.hepb <- grep("*B16[0-9]?$|*B17[0,2,8]?$|*B18[0,1,8,9]?S", 
+ sfdat$icdl10) 

> sfdatShepb[get.hepb] <- "Yes" 

> table (sfdatShepb) 
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No Yes 
14125 23 


Therefore, in San Francisco, during the period 2003-2004, there were 23 deaths 
caused by hepatitis B. Without regular expressions, this mortality analysis would 
have been prohibitively tedious. 

In this next example we use regular expressions to correct misspelled data. Sup- 
pose we have a data vector containing my first name (“Tomas”), but sometimes 
misspelled. We want to locate the most common misspellings and correct them: 


tdat <- c("Tom", "Thomas", "Tomas", "Tommy", "tomas") 
> misspelled <- grep("*[Tt]omm?y?$S|*[Tt]homas$|*tomas$", tdat) 
> misspelled 














PL] 22) 4S 

> tdat [misspelled] <- "Tomas" 

> tdat 

[1] "Tomas" "Tomas" "Tomas" "Tomas" "Tomas" 


3.11.6 Repetition > Concatenation > Alternation 


Repetition takes precedence over concatenation, which in turn takes precedence 
over alternation. A whole subexpression may be enclosed in parentheses to override 
these precedence rules. For example, consider how the following regular expression 
changes when parentheses are used give concatenation precedence over repetition: 


> vec7 <- c("Tommy", "Tomas", "Tomtom") 
> #repetition takes precedence 

> vec7[grep("([Tt]om{2,}", vec7) ] 

[1] "Tommy" 

> #concatenation takes precedence 

> vec7[grep("([Tt]om) {2,}", vec7) ] 

[1] "Tomtom" 





Recall that {2, } means repeat the previous 2 or more times. 


3.11.7 Metacharacters 


Any character, or combination of characters, can be use to specify a pattern except 
for these metacharacters: 


-\1() L{*$* +? 


Metacharacters have special meaning in regular expressions, and these have already 
been presented and are summarized in Table 3.8. However, inside a character class, 
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Table 3.8 Metacharacters used by regular expressions 





Description Example Literal search 
Char. 

. Matches empty string at beginning of line "“my car" See p. 171 
When Ist character in character class, "[*abc]" See p. 171 
matches any single character not in the list 

$ Matches empty string at end of line "my cars" "[s]" 

[ Character class ALA 

; Matches any single character "pt" Mb 

? Repetition quantifier (Table 3.7) ae EZ 

* Repetition quantifier (Table 3.7) eke "{x]" 
Repetition quantifier (Table 3.7) me mileks] 

(_) | Grouping subexpressions "([Tt]om){2,}" "[(]" or "{)]" 
| Join subexpresions, any of which canbe "Tomas|Luis" Eke 
matched 
{ Not used in R regular expressions n/a "TL" 





metacharacters have their literal interpretation. For example, to search for data vec- 
tor elements that contain one or more periods use: 


> vec8 <- c("oswego.dat", "oswego", "“infert.dta", "infert") 
> vec8[grep("[.]", vec8) ] 
[1] "oswego.dat" "infert.dta" 


If we want to include the following characters inside a character class, they re- 
quire special placement: 


1-° 


To include a literal ], place it first in the list. Similarly, to include a literal —, place 
it first or last in the list.!© Finally, to include a literal ~, place it anywhere but first. 


> ages. <--e("<1°1",; "[1-15)", "15-34", "(35,69)", ".[os7-110)") 


> ages[grep("[]*-]", ages) ] 
[1] meq "75-34" W165. Oe" 


To search for a literal ~ as a single character is tricky because it must be placed 
inside a character class but preceded by another character ("*" will not work, and 
"[*]" returns an error). Because we are only interested in finding ~, then the first 
character in the list should be any character we expect not to find in the data vector 
("[/*]" should work). Study the example that follows: 


> vec9d <-— c("8°2", "B89", "Ny*x", "time") 

> grep("/", vec9) #test that / is not in data 
integer (0) 

> vec9[grep("[/7]", vec9) ] 

PA Be 2 hy seh 


16 Although the — sign is not a metacharacter, it does have special meaning inside a character class 
because it is used to specify a range of characters; e.g., [A-Za-z0-9]. 
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Table 3.9 Commonly used functions that use regular expressions 





Function Description 





grep Searches for pattern matches within a character vector; returns integer vector 
indicating vector positions containing pattern 

regexpr Similar to grep but returns integer vectors with detailed information for the first 
occurrence of a pattern match within text string elements of a character vector 

gregexpr Similar to regexpr but returns a list with detailed information for the multiple 
occurrences of a pattern match within text string elements of a character vector 

sub Searches and replaces the first occurrence of a pattern match within text string 
elements of a character vector 

gsub Searches and replaces multiple occurrences of a pattern match within text string 
elements of a character vector 





The first character in the list (/) was selected because there was no match in the data 
vector. 


3.11.8 Other regular expression functions 


For most epidemiologic applications, the grep function will meet our regular ex- 
pression needs. Table 3.9 summarizes other functions that use regular expressions. 
Whereas the grep function enables indexing and replacing elements of a character 
vector, the sub and gsub functions searches and replaces single or multiple pat- 
tern matches within text string elements of a character vector. Review the following 
example: 


> veclO <- c("California", "MiSSISSIppi") 
> grep("SSI", vecl0) #can be used for replacement 








[1] 2 

> sub("SSI", replacement="ssi", vecl10) #replace lst occurrence 
[1] "California" "MissiSSIppi" 

> gsub("SSI", replacement="ssi", vecl0O) #replace all occurrences 
[1] "California" "Mississippi" 








The regexpr function provides detailed information on the first pattern match 
within text string elements of a character vector. It returns two integer vectors. In the 
first vector, —1 indicates no match, and nonzero positive integers indicate the char- 
acter position where the first match begins within a text string. In the second vector, 
the nonzero positive integers indicate the match length. In contrast, the gregexpr 
function provides detailed information on multiple pattern matches within text string 
elements of a character vector. It returns a list where each bin contains detailed in- 
formation (similar to regexpr) for each text string element of a character vector. 
Study the following examples: 


> regexpr("SSI", vecl0) 
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A] ede 3 3 
attr(,"match.length") 
i] =: 3 


> gregexpr("SSI", vecl0) 
[1]] 


L]° =2 
attr(,"match.length") 

1] -1 

[2]] 

1] 3 6 
attr(,"match.length") 

i] 3-3 
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Problems 


3.1. Using RStudio and the data from Table 3.1 on page 107 Create the following 
data frame: 








> dat 

Status Treatment Agegrp Freq 
‘l Dead Tolbutamide <55 8 
2 Survived Tolbutamide <55 98 
3 Dead Placebo <55 5 
4 Survived Placebo <50, Likes 
5 Dead Tolbutamide 55+ 22 
6 Survived Tolbutamide 55+ 76 
7 Dead Placebo 55+ 16 
8 Survived Placebo 55+ 69 








3.2. Select 3 to 5 classmates and collect data on first name, last name, affiliation, 
two email addresses, and today’s date. Using a text editor, create a data frame with 
this data. 


3.3. Review the United States data on AIDS cases by year available at http: // 
www.medepi.net/data/aids.txt. Read this data into a data frame. Graph 
a calendar time series of AIDS cases. 


# Hint 
plot(x, y, type = "1", xlab = "x axis label", lwd = 2, 
ylab = "y axis label", main = "main title") 


3.4. Review the United States data on measles cases by year available at http: 
//www.medepi.net/data/measles.txt. Read this data into a data frame. 
Graph a calendar time series of measle cases using an arithmetic and semi-logarithmic 
scale. 






































# Hint 

plot(x, y, type = "1", lwd = 2, xlab = "x axis label", 
ylab="y axis label", main = "main title") 

plot(x, y, type = "1", lwd = 2, xlab = "x axis label", log = " 
ylab="y axis label", main = "main title") 

















3.5. Review the United States data on hepatitis B cases by year available at http: 
//www.medepi.net/data/hepb.txt. Read this data into a data frame. Us- 
ing the R code below, plot a times series of AIDS and hepatitis B cases. 


matplot (hepb$year, cbind(hepbScases,aids$cases) , 
type = "1", lwd = 2, xlab = "Year", ylab = "Cases", 
main = "Reported cases of Hepatitis B and AIDS, 

United States, 1980-2003") 

legend(1980, 100000, legend = c("Hepatitis B", "AIDS"), 
lwd = 2, lty = 1:2, col T2) 
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Table 3.10 Data dictionary for Evans data set 








Variable Variable name Variable type Possible values 
id Subject identifier Integer 
chd Coronary heart disease Categorical-nominal 0=no 
1 = yes 
cat Catecholamine level Categorical-nominal 0 = normal 
1 =high 
age Age Continuous years 
chl Cholesterol Continuous >0 
smk Smoking status Categorical-nominal 0 = never smoked 
1 = ever smoked 
ecg Electrocardiogram Categorical-nominal 0 = no abnormality 
1 = abnormality 
dbp Diastolic blood pressure Continuous mm Hg 
sbp Systolic blood pressure Continuous mm Hg 
hpt High blood pressure Categorical-nominal 0=no 
1=yes 
(dbp > 95 or 
sbp > 160) 
ch cat x hpt Categorical product term 
cc cat x chl Continuous product term 





3.6. Review data from the Evans cohort study in which 609 white males were fol- 
lowed for 7 years, with coronary heart disease as the outcome of interest (http: 
//www.medepi.net/data/evans.txt). The data dictionary is provided in 
Table 3.10. 


a Recode the binary variables (0, 1) into factors with 2 levels. 
Discretized age into a factor with more than 2 levels. 

c Create a new hyptertension categorical variable based on the current classifica- 
tion scheme!’: 
Normal: SBP< 120 and DBP< 80; 
Prehypertension: SBP=[120, 140) or DBP=[80, 90); 
Hypertension-Stage 1: SBP=[140, 160) or DBP=[90, 100); and 
Hypertension-Stage 2: SBP> 160 or DBP> 100. 

d Using R, construct a contigency table comparing the old and new hypertension 
variables. 


3.7. Review the California 2004 surveillance data on human West Nile virus cases 
available athttp: //www.medepi.net/data/wnv/wnv2004raw.txt. Read 
in the data, taking into account missing values. Convert the calendar dates into the 
international standard format. Using the write.table function export the data 
as an ASCII text file. 


ty http://www.nhlbi.nih.gov/guidelines/hypertension/phycard.pdf 
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3.8. On April 19, 1940, the local health officer in the village of Lycoming, Os- 
wego County, New York, reported the occurrence of an outbreak of acute gas- 
trointestinal illness to the District Health Officer in Syracuse. Dr. A. M. Rubin, 
epidemiologist-in-training, was assigned to conduct an investigation. (See Ap- 
pendix A.2 on page 184 for data dictionary.) 

When Dr. Rubin arrived in the field, he learned from the health officer that all 
persons known to be ill had attended a church supper held on the previous evening, 
April 18. Family members who did not attend the church supper did not become 
ill. Accordingly, Dr. Rubin focused the investigation on the supper. He completed 
Interviews with 75 of the 80 persons known to have attended, collecting information 
about the occurrence and time of onset of symptoms, and foods consumed. Of the 
75 persons interviewed, 46 persons reported gastrointestinal illness. 

The onset of illness in all cases was acute, characterized chiefly by nausea, vom- 
iting, diarrhea, and abdominal pain. None of the ill persons reported having an ele- 
vated temperature; all recovered within 24 to 30 hours. Approximately 20% of the 
ill persons visited physicians. No fecal specimens were obtained for bacteriologic 
examination. The investigators suspected that this was a vehicle-borne outbreak, 
with food as the vehicle. Dr. Rubin put his data into a line listing. !8 

The supper was held in the basement of the village church. Foods were con- 
tributed by numerous members of the congregation. The supper began at 6:00 p.m. 
and continued until 11:00 p.m. Food was spread out on a table and consumed over 
a period of several hours. Data regarding onset of illness and food eaten or water 
drunk by each of the 75 persons interviewed are provided in the line listing. The ap- 
proximate time of eating supper was collected for only about half the persons who 
had gastrointestinal illness. 


a. Using RStudio plot the cases by time of onset of illness (include appropriate 
labels and title). What does this graph tell you? (Hint: Process the text data and 
then use the hist function.) 

b. Are there any cases for which the times of onset are inconsistent with the general 

experience? How might they be explained? 

. How could the data be sorted by illness status and illness onset times? 

d. Where possible, calculate incubation periods and illustrate their distribution with 
an appropriate graph. Use the truehist function in the MASS package. De- 
termine the mean, median, and range of the incubation period. 


Q 


18 See data set at http://www.medepi.net/data/oswego/oswego.txt. 
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APPENDIX A 





Available data sets 





A.1 Latina Mothers and their Newborn 


From 1980 to 1990 data was collected on 427 Latino mothers that gave birth at 
the University of California, San Francisco [12, 13]. Data was collected on the 
characteristics of the mothers and their newborn infants (Table A.1). Mothers were 
weighed at each prenatal visit. Rate of weight gain during each trimester was based 
on a linear regression interpolation. The data set can be viewed and downloaded 
from http: //www.medepi.net/data/birthwt9.txt. 


Table A.1 Data dictionary for Latina mothers and their newborn infants 








Variable Description Possible values 

age Maternal age In years (self-reported) 

parity Parity Count of previous live births 

gest Gestation Reported in days 

sex Gender Male = 1, Female = 2 

bwt Birth weight Grams 

cigs Smoking Number of cigarettes per day 
(self-reported) 

ht Maternal height Measured in centimeters 

wt Maternal weight Pre-pregnancy weight (self-reported) 

rl Rate of weight gain (1st trimester) Kilograms per day (estimated) 

©2 Rate of weight gain (2nd trimester) Kilograms per day (estimated) 

x2 Rate of weight gain (3rd trimester) Kilograms per day (estimated) 
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A.2 Oswego County (outbreak) 


On April 19, 1940, the local health officer in the village of Lycoming, Oswego 
County, New York, reported the occurrence of an outbreak of acute gastrointestinal 
illness to the District Health Officer in Syracuse. Dr. A. M. Rubin, epidemiologist- 
in-training, was assigned to conduct an investigation. 

When Dr. Rubin arrived in the field, he learned from the health officer that all 
persons known to be ill had attended a church supper held on the previous evening, 
April 18. Family members who did not attend the church supper did not become 
ill. Accordingly, Dr. Rubin focused the investigation on the supper. He completed 
Interviews with 75 of the 80 persons known to have attended, collecting information 
about the occurrence and time of onset of symptoms, and foods consumed. Of the 
75 persons interviewed, 46 persons reported gastrointestinal illness. 

The onset of illness in all cases was acute, characterized chiefly by nausea, vom- 
iting, diarrhea, and abdominal pain. None of the ill persons reported having an el- 
evated temperature; all recovered within 24 to 30 hours. Approximately 20 physi- 
cians. No fecal specimens were obtained for bacteriologic examination. 

The supper was held in the basement of the village church. Foods were con- 
tributed by numerous members of the congregation. The supper began at 6:00 p.m. 
and continued until 11:00 p.m. Food was spread out on table and consumed over 
a period of several hours. Data regarding onset of illness and food eaten or water 
drunk by each of the 75 persons interviewed are provided in the attached line listing 
(Oswego dataset). The approximate time of eating supper was collected for only 
about half the persons who had gastrointestinal illness. 

The data set can be viewed and downloaded from http://www.medepi. 
net/data/oswego.txt. The data dictionary is provided in Table A.2 on the 
facing page. 


A.3 Western Collaborative Group Study (cohort) 


The Western Collaborative Group Study (WCGS), a prospective cohort studye, re- 
cruited middle-aged men (ages 39 to 59) who were employees of 10 California com- 
panies and collected data on 3154 individuals during the years 1960-1961. These 
subjects were primarily selected to study the relationship between behavior pat- 
tern and the risk of coronary hearth disease (CHD). A number of other risk factors 
were also measured to provide the best possible assessment of the CHD risk associ- 
ated with behavior type. Additional variables collected include age, height, weight, 
systolic blood pressure, diastolic blood pressure, cholesterol, smoking, and corneal 
arcus. The median follow up time was 8.05 years. 

The data set can be viewed and downloaded from http://www.medepi. 
net /data/wcgs.txt. The data dictionary is provided in Table A.3 on page 186. 
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Table A.2 Data dictionary for Oswego County data set 








Variable Possible values 

id Subject identificaton number 

age Age in years 

sex Sex: F = Female, M = Male 
meal.time Meal time on April 18th 

ill Developed illness: Y = Yes N = No 
onset.date Onset date: ”4/18” = April 18th, ’4/19” = April 19th 
onset.time Onset time: HH:MM AM/PM 
baked.ham Consumed item: Y = Yes; N = No 
spinach Consumed item: Y = Yes; N = No 
mashed.potato Consumed item: Y = Yes; N = No 
cabbage.salad Consumed item: Y = Yes; N = No 
jello rolls Consumed item: Y = Yes; N = No 
brown.bread Consumed item: Y = Yes; N = No 
milk Consumed item: Y = Yes; N = No 
coffee Consumed item: Y = Yes; N = No 
water Consumed item: Y = Yes; N = No 
cakes Consumed item: Y = Yes; N = No 
vanilla.ice.cream Consumed item: Y = Yes; N = No 
chocolate.ice.cream Consumed item: Y = Yes; N = No 
fruit.salad Consumed item: Y = Yes; N = No 





A.4 Evans County (cohort) 


The Evans County data set is used to demonstrate a standard logistic regression 
(unconditional) [15]. The data are from a cohort study in which 609 white males 
were followed for 7 years, with coronary heart disease as the outcome of interest. 

The data set can be viewed and downloaded from http://www.medepi. 
net/data/evans.txt. The data dictionary is provided in Table A.4 on the fol- 
lowing page. 


A.5 Myocardial infarction case-control study 


The myocardial infarction (MI) data set [15] is used to demonstrate conditional 
logistic regression. The study is a case-control study that involves 117 subjects in 
39 matched strata (matched by age, race, and sex). Each stratum contains three 
subjects, one of whom is a case diagnosed with myocardial infarction and the other 
two are matched controls. 

The data set can be viewed and downloaded from http://www.medepi. 
net/data/mi.txt. The data dictionary is provided in Table A.5 on page 187. 
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Table A.3 Data dictionary for Western Collaborative Group Study data set 




















Variable —_ Variable name Variable type Possible values 
id Subject ID Integer 2001-22101 
age0 Age Continuous 39-59 years 
heightO Height Continuous 60-78 in 
weightO Weight Continuous 78-320 Ib 
sbp0 Systolic blood pressure Continuous 98-230 mm Hg 
dbp0 Diastolic blood pressure Continuous 58-150 mm Hg 
chol0 Cholesterol Continuous 103-645 mg/100 ml 
behpatO —- Behavior pattern Categorical 1=Type Al 
2 = Type A2 
3 = Type BI 
4 = Type B2 
ncigsO Smoking Integer Cigarettes/day 
dibpatO Behavior pattern Categorical 0 = Type B 
1=Type A 
chd69 Coronary heart disease event Categorical 0 = None 
1 = Yes 
typechd Coronary heart disease event Categorical 0 = CHD event 
1 = Symptomatic MI 
2 = Silent MI 
3 = Classical angina 
timel69 = Observation (follow up) time Continuous 18-3430 days 
arcusO Corneal arcus Categorical 0 = None 
1 = Yes 
Table A.4 Data dictionary for Evans data set 
Variable Variable name Variable type Possible values 
id Subject identifier Integer 
chd Coronary heart disease Categorical-nominal 0=no 
1 = yes 
cat Catecholamine level Categorical-nominal 0 = normal 
1 =high 
age Age Continuous years 
chl Cholesterol Continuous >0 
smk Smoking status Categorical-nominal 0 = never smoked 
1 = ever smoked 
ecg Electrocardiogram Categorical-nominal 0 = no abnormality 
1 = abnormality 
dbp Diastolic blood pressure — Continuous mm Hg 
sbp Systolic blood pressure Continuous mm Hg 
hpt High blood pressure Categorical-nominal 0=no 
1 =yes 
(dbp> 95 or sbp> 160) 
ch cat x hpt Categorical product term 
cc cat x chl Continuous product term 
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Table A.5 Data dictionary for myocardial infarction (MJ) case-control data set 








Variable —- Variable name Variable type Possible values 

match Matching strata Integer 1-39 

person Subject identifier Integer 1-117 

mi Myocardial infarction Categorical- 0=No 
nominal 1 = Yes 

smk Smoking status Categorical- 0 = Not current smoker 
nominal 1 = Current smoker 

sbp Systolic blood pressure Categorical- 120, 140, or 160 
ordinal 

ecg Electrocardiogram Categorical- 0 = No abnormality 
nominal 1 = abnormality 





A.6 AIDS surveillance cases 


http: //www.medepi.net/data/aids.txt 


A.7 Hepatitis B surveillance cases 


http://www.medepi.net/data/hepb.txt 


A.8 Measles surveillance cases 


http: //www.medepi.net/data/measles.txt 


A.9 University Group Diabetes Program 


http://www.medepi.net/data/ugdp.txt 


A.10 Novel influenza A (H1N1) pandemic 


A.10.1 United States reported cases and deaths as of July 23, 2009 


http://www.medepi.net/data/hinlpanflu23jul09usa.txt 
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Outbreak analysis template in R 





We provide an analysis template using R (and the ’epitools’ package). Examples in- 
volve human West Nile virus surveillance, other data sets (AIDS, measles, hepatitis 
B, etc.). 


Read data 
Human West Nile virus disease surveillance, California, 2004. 


wnv <-read.table("http://www.medepi.net/data/wnv/wnv2004raw.txt", 
sep = ",", header = TRUE, na.strings = ".") 

str(wnv) #display data set structure 

head(wnv) #display first 6 lines 

edit (wnv) #browse data frame 

fix(wnv) #browse with ability to edit (be careful!!!) 

















Convert non-standard dates to Julian dates 


wnvSdate.onset2 <- as.Date(wnv$date.onset, format="%m/%d/%Y") 
wnvS$date.tested2 <- as.Date(wnv$date.tested, format="%m/%d/%Y") 


Display histogram of onset dates (epidemic curve) 


hist (wnvSdate.onset2, breaks= 26, freq=TRUE, col="slategrayl") 





189 


190 B Outbreak analysis template in R 
Describe a continuous variable (e.g., age) 
summary (wnvSage) # no standard deviation provided 


range (wnvSage, na.rm=TRUE); mean(wnvSage, na.rm=TRUE) 
median(wnvSage, na.rm=TRUE); sd(wnvSage, na.rm=TRUE) 








Describe continuous variable, stratified by a categorical variable 


tapply (wnvSage, wnvSsex, mean, na.rm = TRUE) 
tapply (wnvSage, wnvScounty, mean, na.rm = TRU 











Gl 
~~ 


Display a continuous variable 


hist (wnvSage, xlab="x", ylab="y", main="title", col="skyblue") 


Describe a categorical variable (e.g., sex) 
sex.tab <- xtabs(~sex, data = wnv) 
sex.dist <- prop.table(sex.tab) 
cbind(sex.tab, sex.dist) 


Display a categorical variable (e.g. sex) 


barplot (sex.tab, col="pink", ylab="Frequency", main="title") 


Re-code continuous variable to categorical (e.g., age) 





GI 
~~ 


wnvSage3 <- cut (wnvSage, breaks=c(0,45,65,100), right=FALS! 
age3.tab <- xtabs("age3, data = wnv) 

age3.dist <- prop.table(age3.tab) 

cbhind(age3.tab, age3.dist) 


Describe two categorical variables (e.g. sex and age) 


sexage <- xtabs(~sex + age3, data = wnv) 
sexage 

prop.table (sexage) #joint distribution 
prop.table(sexage, 1) #row distribution 
prop.table(sexage, 2) #column distribution 





Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


B Outbreak analysis template in R 191 


Plot age vs sex distribution 












































barplot (sexage, legend.text=TRUE, 
xlab="Age", ylab="Frequency", main="title") 
barplot (sexage, legend.text=TRUE, beside=TRUE, 
xlab="Age", ylab="Frequency", main="title") 
barplot (t (sexage), legend.text=TRUE, ylim=c(0, 650), 
xlab="Sex", ylab="Frequency", main="title") 
barplot (t (sexage), legend.text=TRUE, beside=TRUE, ylim=c(0, 300), 
xlab="Sex", ylab="Frequency", main="title") 








Hypothesis testing using 2-way contingency tables 
From the main menu select Packages > Install Package(s). Select CRAN mirror 
near you. Select epitools package. 


library (epitools) #load ’epitools’; only needed once per session 
tab.age3 <- xtabs(”~age3 + death, data = wnv) 


epitab (tab.age3) #default is odds ratio 
epitab(tab.age3, method = "riskratio") 

prop.table(tab.age3, 1) #display row distribution (2=column) 
prop.test (tab.age3[,2:1]) #remember to reverse columns 
chisq.test (tab.age3) #Chi-square test 

fisher.test (tab.age3) #Fisher exact test 


Graphical display of epidemiologic data 
Histogram (continuous numbers or date objects) 


hist (wnvSage, xlab="x", ylab="y", main="title", col="skyblue") 
hist (wnvSdate.onset2, breaks= 26, freq=TRUE, col="slategrayl") 





Bar chart (categorical variable) 


barplot (table (wnvSsex), col="skyblue", xlab="Sex", ylab="Freq", 
main="title", legend = TRUE, ylim=c (0,600) ) 





Stacked bar chart (2 or more categorical variables) 


barplot (table (wnv$sex, wnvSage3), col=c("blue","green"), 
xlab="Sex", ylab="Freq", main="WNV Disease, Sex by Age", 
legend = TRUE, ylim=c(0, 400) ) 
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Group bar chart (2 or more categorical variables): 


barplot (table (wnvSsex, wnvSage3), beside=TRUE, xlab="Sex", 
ylab="Fregq", main="Sex by Age", col=c("b] 


legend = TRUE, ylim=c(0,250) ) 











lue","green"), 





Proportion bar chart (2 or more categorical variables) 


sexage <- xtabs(~sex + age3, data = wnv) 

barplot (prop.table(sexage, 2), xlab="Sex", 
main="WNV Disease, Sex by Age", 
legend = TRUE, ylim=c(0,1.2) ) 


ylab="Proportion", 
col=c("blue","green"), 





Time series (single x values vs. single y values) 


United States measles surveillance data 


measles <- read.table("http://www.medepi.net/data/measles.txt", 
sep="", header=TRUE) 























str(measles); head(measles) 
plot (measlesSyear, measlesScases, type="1", lwd=2, col="navy") 
plot (measlesSyear, measlesScases, type="1", lwd=2, log="y") 








Time series (multiple x values vs multiple y values) 


United State AIDS and hepatitis B surveillance data 


aids <- read.table("http://www.medepi.net/data/aids.txt", 
sep="", header=TRUE, na.strings=".") 

hepb <- read.table("http://www.medepi.net/data/hepb.txt", 
sep="", header=TRUE) 

years <- cbind(aidsSyear, hepbSyear) 

cases <- chind(aids$cases, hepbScases) 

matplot (years, cases, type="1", lwd=2, col=1:2, main="title") 

legend (x=1980, y=80000, legend= c("AIDS","Hepatitis B"), 

lty=1:2, col=1:2, lwd=2) 
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Working with dates and times 
Convert non-standard dates to standard Julian dates 


dates <- c("11-02-1959","1959Nov02","November 2, 1959") 
jdates<-as.Date(dates, format=c("Sm-%Sd-SY","SY%Sb%d","SB %d, %SY")) 
jdates; julian (jdates) 


Converting non-standard dates and times to R date-time object: 


dtim <- c("4/19/1940 12:30 AM", "4/18/1940 9:45 PM") 
std.dt <- strptime(dtim, format="%m/%Sd/%Y %1I:3M %p") 
std.dt 


Try ?strptime to see all format options. 


Manually creating an epidemic curve 
Single variable 


labs <- c("Sun", "Mon", "Tue", "Wed", "Thu","Fri", "Sat") 

Gases <= 7:60, 25, .15,. 5, 10,..20;,. 0) 

names (cases) <- labs 

barplot (cases, space=0, col="skyblue", xlab="Day", ylab="Cases", 
main="Title") 


Single variable—Change x-axis labels to perpendicular 
xv <- barplot (cases, space=0, col="red", xlab="Day", 


ylab="Cases", main="Title", axisnames=FALS! 
axis(side=l1, at=xv, labels=labs, las=2) 





Gl 
~~ 


Stratified by second variable 


male.cases <- c(0, 15, 10, 3, 5, 5, 0) 

female.cases <- c(0, 10, 5, 2, 5, 15, 0) 

cases2 <- rbind(Male = male.cases, Female = female.cases) 

colnames(cases2) <- labs 

xv <- barplot (cases2, space=0, col=c("blue", "green"), 
xlab="Day", ylab="Cases", main="Title", 
axisnames=FALSE, legend.text=TRUE, ylim=c(0, 30)) 

axis(side=l1, at=xv, labels=labs, las=2) 
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Running batch jobs 


source ("c:/myoutbreak/job01.R") #run program file called job01.R 


Creating output log files 


From within job01.R program file 


x <- 1:5 
y <- x*2 


Sink printed objects to log file 


sink ("c:/temp/job.log") 
print (x) 
sink () 


Capture output without requiring print command 


capture.output (cbind(x, y), 
file="c:/temp/job.log", append=TRUE) 


Multivariable analysis 
Logistic regression (binomial data: cohort, case-control) 


Using WNV data with age3 variable created previously: 


modl <- glm(death ~ age3, family=binomial, data=wnv) 
summary (mod1) #full results 

exp (modlScoef) #calculate odds ratio 

mod2 <- glm(death ~ age3 + sex, family=binomial, data=wnv) 
summary (mod2) #full results 

exp (mod2Scoef) #calculate odds ratio 





























Conditional logistic regression (matched case-control) 
Here is a case-control study of myocardial infarction (Kleinbaum 2002): one case 
was matched to two controls on age, race, and sex. 


library (survival) #load survival package 
chd <- read.table("http://www.medepi.net/data/chd.txt", sep=",", 
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header=TRU 
head (chd) 
chd$mi2 <- ifelse(chd$mi=="Yes", 1, 0) #re-code case status 
modl <- clogit (mi2~smk+strata(match), data=chd) 

summary (mod1) 

mod2 <- clogit (mi2”~smk+sbpt+strata(match), data=chd) 

summary (mod2) 
mod3 <- clogit (mi2~smk+sbpt+ecg+t+strata(match), data=chd) 
summary (mod3) 

anova (mod1,mod2,mod3, test="Chisq") #compare nested models 





Gl 


) 








Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


APPENDIX C 





Programming and creating R functions 





“Good programmers write good code, great programmers borrow good code.” 


R is a comprehensive and powerful programming language. In this section we 
briefly summarize how to use R for introductory programming, including writing 
and executing functions. 


C.1 Basic programming 


Basic epidemiologic programming in R is just a list of R expressions, that are col- 
lected and executed as a batch job. The list of R expressions in an ASCII text file 
with a .R extension. In RStudio, from the main menu, select File ~ New > R 
Script. This script file will be saved with an .R extension. This script file can be 
edited and executed within RStudio. Alternatively, we can edit this file using our 
favorite text editor (e.g., GNU Emacs). 

What are the characteristics of good R programming? 


e Use a good text editor (or RStudio) for programming 
e Organize batch jobs into numbered sequential files (e.g., job01.R) 
e Avoid graphical menu-driven approaches 


First, use a good text editor (or RStudio) for programming. Each R expression 
will span one or more lines. Although one could write and submit each line at the 
R console, this approach is inefficient and not recommended. Instead, type the ex- 
pressions into your favorite text editor and save with a .R extension. Then, selected 
expressions or the whole file can be executed in R. Use the text editor that comes 
with R, or text editors customized to work with R (e.g., RStudio, Emacs with ESS). 
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Second, we organize batch jobs into sequential files. Data analysis is a se- 
ries of tasks involving data entry, checking, cleaning, analysis, and reporting. Al- 
though data analysts are primarily involved in analysis and reporting, they may 
be involved in earlier phases of data preparation. Regardless of stage of involve- 
ment, data analysts should organize, conduct, and document their analytics tasks 
and batch jobs in chronological order. For example, batch jobs might be named 
as follows: job01-cleaning.R, job02-recoding.R, job03-descriptive.R, job04- 
logistic.R, etc. Naming the program file has two components: jobs’ represent ma- 
jor tasks and are always numbered in chronological order (job01-*.R, job02-*.R, 
etc.); and a brief descriptor can be appended to the first component of the file name 
(job01-recode-data.R, job02-bivariate-analysis.R). 

If one needs to repeat parts of a previous job, then add new jobs, not edit old 
jobs. This way our analysis can always be reviewed, replicated, and audited exactly 
in order it was conducted. We avoid editing earlier jobs. If we edit previous jobs, 
then we must rerun all subsequent jobs in chronological order. 

Third, we avoid graphical, menu-driven approaches. While this is a tempting ap- 
proach, our work cannot be documented, replicated, and audited. The best approach 
is to collect R expressions into batch jobs and run them using the source function. 


C.2 Intermediate programming 


The next level of R programming involves (1) implementing control flow (decision 
points); (2) implementing dependencies in calculations or data manipulation; and 
(3) improving execution efficiency, 


C.2.1 Control statements 


Control flow involves one or more decision points. A simplest decision point goes 
like this: if a condition is TRUE, do {this} and then continue; if it is FALSE, do not 
do {this} and then continue. When R continues, the next R expression can be any 
valid expression, including another decision point. 








C.2.1.1 The if function 


We use the if function to implement single decision points. 








if (TRUE) {execute these R expressions} 


If the condition if false, R skips the bracketed expression and continues executing 
subsequent lines. Study this example: 


S&S <= ol, 9, Wa, 4p 5) 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


C.2 Intermediate programming 199 
> y <- c(1, 2, 3, 4, 5) 
> if(any(is.na(x))) {x[is.na(x)] <- 999} 
> & 
[1] 1 2 999 4 5 
> if(any(is.na(y))) {yl[is.na(y)] <- 999} 
2 ¥ 
Edy 2 23. 455 


The first if condition evaluated to TRU! 
second if condition evaluated to FALS 
evaluated. 





C.2.1.2 The else functions 


E and the missing value was replaced. The 
E and the bracketed expressions were not 





Up to now the if condition had only one possible response. If there are two, mutu- 


ally exclusive possible responses, add on 








e else statement: 


if (TRUE) { 

xecute these R expressions 
} else { 

xecute these R expressions 





‘i 


Here is an example: 





> x <- c(l, 2, NA, 4, 5); 
>y <- c(1, 2, 3, 4, 5) 

> if(any(is.na(x))) { 

+ x[is.na(x)] <- 999; 

+ } else {cat("No missing 
NAs replaced 

> if(any(is.na(y))) { 

+ y[is.na(y)] <- 999; 

+ } else {cat ("No missing 
No missing values; 

> x 

[1] 1 2 999 4 5 

2 y 

[1] 12345 

2 y 

[1] 123 45 


Therefore, use the if and else combin 
possible collection of R expressions. 


cat ("NAs replaced\n") 


values; no replacement \n") } 


cat ("NAs replaced\n") 


values; no replacement \n") } 


no replacement 


ation if one needs to evaluat of one of two 


If one needs to evaluate possibly one of two possible collection of R expressions 


then use the following pattern: 
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if (TRUE) { 

xecute these R expressions 
} else if (TRUE) { 

xecute these R expressions 
} 


The if and else functions can be combined to achieve any desired control flow. 











GI 





C.2.1.3 The “short circuit” logical operators 


The “short circuit” && and || logical operators are used for control flow in if 
functions. If logical vectors are provided, only the first element of each vector is 
used. Therefore, for element-wise comparisons of 2 or more vectors, use the & and | 
operators but not the && and | | operators (discussed in Chapter 2). For if function 
comparisons use the && and | | operators. 

Suppose we want to square the elements of a numeric vector but not if it is a 
matrix. 


> x <- 1:5 

> y <- matrix(1:4, 2, 2) 

> if (is.numeric(x) && !is.matrix(x)) { 

+ x2 

+ } else cat("Either not numeric or is a matrix\n") 
1] 1 4 9 16 25 

> if (!is.matrix(y) && is.numeric(y)) { 

a y*2 
+ } else cat("Either not numeric or is a matrix\n") 
Either not numeric or is a matrix 

















The && and | | operators are called “short circuit” operators because not all its 
arguments may be evaluated: moving from left to right, only sufficient arguments 
are evaluated to determine if the if function should return TRUE or FALSE. This 
can save considerable time if some the arguments are complex functions that require 
significant computing time to evaluate to either TRUE or FALSE. In the previous 
example, because !is.matrix(y) evaluates to FALSE, it was not necessary to 
evaluate is.numeric(y). 


C.2.2 Vectorized approach 


An important advantage of R is the availability of functions that perform vectorized 
calculations. For example, suppose we wish to add to columns of a matrix. Here is 
one approach: 


> tab <- matrix(1:12, 3, 4) 
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> tab 
[,1] [,2] [,3] [,4] 
fay J iL. 4 7 10 
[2,] 2 5 8 11 
[35] 3 6 9 12 
> colsum <- tab[,1]+tab[,2]+tab[,3]+tab[,4] 
> colsum 
[1] 22 26 30 


However, this can be accomplished more efficiently using the apply function: 


> colsum2 <- apply(tab, 1, sum) 
> colsum2 
[1] 22 26 30 





In general, we want to use these types of functions (e.g., tapply, sweep, 


outer, mean, etc.) because they have been optimized to performed vectorized 
calculations. 


C.2.2.1 The ifelse function 


The ifelse function is a vectorized element-wise implementation of the if and 


else functions. We demonstrate using the practical example of recoding a 2-level 
variable. 


> sex <- Cc CUM", NA, WM mE, NA, "mM", VE "M") 





> sex2 <- ifelse(sex=="M", "Male", "Female") 
> sex2 
[1] "Male" NA "Female" "Female" NA "Male" "Female" "Male" 


If an element of sex contains ”*M” (TRUE), it is recoded to ”Male” (in sex2), and 


otherwise (FALSE) it is recoded to Female”. This assumes that there are only ”M’’s 
and ”F’s in the data vector. 


C.2.3 Looping 


Looping is a common programming approach that is discouraged in R because it 
is inefficient. It is much better to conduct vectorized calculations using existing 
functions. For example, suppose we want to sum a numeric vector. 

> x <- 1:10 

> xsum <- 0 
> for (i in 1:10) { 
+ xsum <- xsum + x[i] 
+ 
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> xsum 
fl] 55 


A much better approach is to use the sum function: 


> sum (x) 
fae] 55 


Unless it is absolutely necessary, we avoid looping. 

Looping is necessary when (1) there is no R function to conduct a vectorized cal- 
culation, and (2) when the result of an element depends on the result of a preceding 
element which may not be known beforehand (e.g., when it is the result of a random 
process). 


C.2.3.1 The for function 


The previous example was a for loop. Here is the syntax: 


for (i in somevector{ 
do some calcuation with ith element of somevector 


} 


In the for function R loops and uses the ith element of somevector either directly 
or indirectly (e.g., indexing another vector). Here is using the vector directly: 


> for(i in 1:3) { 
+ Cat (4725 "\n") 





The letters contain the American English alphabet. Here we use an integer vec- 
tor for indexing letters: 


> for(i in 1:3) { 
+ cat (letters [i],"\n") 
+ } 





Qo o 


Somevector can be any vector: 


> kids <- c("Tomasito", "Luisito", "Angela") 
> for (i in kids) {print (i) } 

[1] "Tomasito" 

[1] "Luisito" 

[1] "Angela" 
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C.2.3.2 The while function 


The while function will continue to evaluate a collection of R expressions while a 
condition is true. Here is the syntax: 





while (TRUE) { 
xecute these R expressions 





Here is a trivial example: 


> x <- 1; z <- 0 
> while(z < 5){ 
+ show (z) 

+ ZS UZ oP OS 
+ 
[ 
[ 
[ 
[ 
[ 


PRP PPP 
WNRO 


} 
] 
] 
] 
] 
] 4 


The while function is used for optimization functions that are converging to a 
numerical value. 


C.2.3.3 The break and next functions 


The break expression will break out of a for or while loop if a condition is met, 
and transfers control to the first statement outside of the inner-most loop. Here is the 
general syntax: 


for (i in somevector{ 
do some calcuation with ith element of somevector 
if (TRUE) break 


The next expression halts the processing of the current iteration and advances 
the looping index. Here is the general syntax: 


for (i in somevector{ 


do some calcuation with ith element of somevector 
if (TRUE) next 


Both break and next apply only to the innermost of nested loops. 
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C.2.3.4 double for 


In the next example we nest two for loop to generate a multiplication table for the 
integers 6 to 10: 


> x <- 6:10 





> mtab <- matrix(NA, 5, 5) 

> rownames(mtab) <- x 

> colnames(mtab) <- x 

> fort ian. D5)4 

+ for(j in 1:5){ 

+ mtab[i, j] <- x[il*x[j] 
+. 3 

+ 

> mtab 


6 7 8 9 10 
6 36 42 48 54 60 
7 42 49 56 63 70 
8 48 56 64 72 80 
9 54 63 72 81 90 
10 60 70 80 90 100 


C.3 Writing R functions 


Writing R functions involves three steps: 


e Prepare inputs 
e Docalculations 
e Collect results 


The best way to learn these steps is to incorporate them into our regular R pro- 
gramming. For example, suppose we are writing R code to calculate the odds ratio 
from a 2 x 2 table with the appropriate format. For this we will use the Oswego data 
set available from the epitools package. 


## Prepare inputs 
library (epitools) 
data (oswego) 
tabl = xtabs(~ ill + spinach, data = oswego) 
tabl 
spinach 
ill N yY 
N 12 17 
Y 20 26 
> aa = tabl{[l, 1] 


VVVV NV 
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bb = tabl[1, 2] 
cc = tabl1[2, 1] 
dd tab1l[2, 2] 


## Do calculations 
crossprod.OR = (aax*dd) / (bbx*cc) 


## Collect results 

list (data = tabl, odds.ratio = crossprod.OR) 
Sdata 

spinach 

ill N Y 

N 12 17 

Y 20 26 


VVVVVVV VV 





Sodds.ratio 
[1] 0.9176471 


Now that we are familiar of what it takes to calculate an odds ratio from a 2-way 
table we can convert the code into a function and load it at the R console. Here is 
new function: 


myOR = function (x) { 
## Prepare input 
## X = 2x2 table amenable to cross-—product 


aa = x[l1, 1] 
bb = x[1, 2] 
ee = ox [2,. 1] 
dd = x[2, 2] 


## Do calculations 
crossprod.OR = (aaxdd) / (bbx«cc) 


## Collect results 
list (data = x, odds.ratio = crossprod.OR) 


} 


Now we can test the function: 


> tab.test = xtabs(~ ill + spinach, data = oswego) 
> myOR(tab.test) 
Sdata 
spinach 
ill N yY 
N 12 17 
Y 20 26 
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Sodds.ratio 
[1] 0.9176471 


C.3.1 Arguments default values 


Now suppose we wish to add calculating a 95% confidence interval to this function. 
We will use the following normal approximation standard error formula for an odds 


ratio: 


1 1 1 1 


And here is the (1 — &)% confidence interval: 





OR,, ORy = exp{log(OR) + Zy/2SE [log(OR)|} 


Here is the improved function: 


myOR2 = function(x, conf.level) { 
## Prepare input 
## X = 2x2 table amenable to cross-—product 





aa = x[l, 1] 
bb = x[1, 2] 
ce = x[2, 1] 
dd = x[2, 2] 
if (missing(conf.level)) stop("Must specify confidence level") 


Z <- qnorm((1 + conf.level)/2) 


## Do calculations 

logOR <- log((aaxdd) / (bb«cc) ) 

SE.logOR <- sqrt(l/aa + 1/bb + 1/cc + 1/dd) 
OR <- exp (logOR) 

CI <- exp(logOR + c(-1, 1) *Z*SE.1logOR) 











## Collect results 
list (data = x, odds.ratio = OR, conf.int = CI) 
} 


Notice that conf. level is a new argument, but with no default value. If a user 
forgets to specify a default value, the following line handles this possibility: 


if (missing(conf.level)) stop("Must specify confidence level") 


Now we test this function: 


> tab.test = xtabs(~ ill + spinach, data = oswego) 
> myOR2 (tab.test) 
Error in myOR2(tab.test) : Must specify confidence level 
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> myOR2(tab.test, 0.95) 
Sdata 
spinach 
Pee. NY: 
N 12 17 
Y 20 26 


Sodds.ratio 
[1] 0.9176471 


Sconf.int 
[1] 0.3580184 2.3520471 
If an argument has a usual value, then specify this as an argument default value: 


myOR3 = function(x, conf.level = 0.95) { 
## Prepare input 
## X = 2x2 table amenable to cross-—product 





aa = x[l, 1] 
bb = x[1, 2] 
ce = x[2, 1] 


dd = x[2, 2] 
Z <- qnorm((1 + conf.level)/2) 


## Do calculations 

logOR <- log((aa*dd) / (bb«cc) ) 

SE.logOR <- sqrt(l/aa + 1/bb + 1/cc + 1/dd) 
OR <-— exp (lLogOR) 

CI <- exp(logOR + c(-1, 1)*Z*SE.1logOR) 











## Collect results 
list (data = x, odds.ratio = OR, conf.int = CI) 
} 


We test our new function: 


> tab.test = xtabs(~ ill + spinach, data = oswego) 
> myOR3 (tab.test) 
Sdata 
spinach 
TAC AN SY: 
N 12 17 
Y 20 26 


Sodds.ratio 
[1] 0.9176471 
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Sconf.int 
[1] 0.3580184 2.3520471 


> myOR3(tab.test, 0.90) 
Sdata 
spinach 
ill N Y 
N 12 17 
Y 20 26 


Sodds.ratio 
[1] 0.9176471 


Sconf.int 
[1] 0.4165094 2.0217459 


C.3.2 Passing optional arguments using the .. . function 


On occasion we will have a function nested inside one of our functions and we need 
to be able to pass optional arguments to this nested function. This commonly occurs 
when we write functions for customized graphics but only wish to specify some 
arguments for the nested function and leave the remaining arguments optional. For 
example, consider this function: 


myplot = function(x, y, type = "b", ...){ 
plot(x, y, type = type, ...) 
} 


When using myplot one only needs to provide x and y arguments. The type 
option has been set to a default value of ’b”. The . . . function will pass any optional 
arguments to the nested plot function. Of course, they optional arguments must be 
valid options for plot function. 





C.4 Advanced topics 


C.4.1 Lexical scoping 


The variables which occur in the body of a function can be divided into three classes; 
formal parameters, local variables and free variables. The formal parameters of a 
function are those occurring in the argument list of the function. Their values are 
determined by the process of binding the actual function arguments to the formal 
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parameters. Local variables are those whose values are determined by the evalu- 
ation of expressions in the body of the functions. Variables which are not formal 
parameters or local variables are called free variables. Free variables become local 
variables if they are assigned to. Consider the following function definition. 


£ <- function (x) { 
y <7 2kx 
print (x) 
print (y) 
print (z) 
} 


In this function, x is a formal parameter, y is a local variable and z is a free vari- 
able. In R the free variable bindings are resolved by first looking in the environment 
in which the function was created. This is called lexical scope. If the free 
variable is not defined there, R looks in the enclosing environment. For the function 
f this would be the global environment (workspace). 

To understand the implications of lexical scope consider the following: 


> rm(list = 1s()) 

> 1ls() 

character (0) 

> £ <- function(x) { 








+ y <- 2kx 
+ print (x) 
+ print (y) 
+ print (z) 
+t *} 
> £(5) 

A. 4-5 

1] 10 
Error in print(z) : object ’z’ not found 
> z= 99 
> £(5) 

[1] 5 

[1] 10 

EL] 99 


In the £ function z is a free variable. The first time f is executed z is not defined in 
the function. R looks in the enclosing environment and does not find a value for z 
and reports an error. However, when a object z is created in the global environment, 
R is able to find it and uses it. 

Lexical scoping is convenient because it allows nested functions with free vari- 
ables to run provided the variable has been defined in an enclosing environment. 
This convenience becomes obvious when one writes many programs. However, 
there is a danger: an unintended free variable many find an unintended value in an 
enclosing environment. This may go undetected because no error is reported. This 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 


210 C Programming and creating R functions 


can happen when there are many objects in the workspace from previous sessions. A 
good habit is to clear the workspace of all objects at the beginning of every session. 

Here is another example from the R introductory manual!. Consider a function 
called cube. 


cube <- function(n) 
sq <- function() 
n«sq() 


{ 
nen 


} 


The variable n in the function sq is not an argument to that function. Therefore it 
is a free variable and the scoping rules must be used to ascertain the value that is to 
be associated with it. Under static scope (S-Plus) the value is that associated with a 
global variable named n. Under lexical scope (R) it is the parameter to the function 
cube since that is the active binding for the variable n at the time the function sq 
was defined. The difference between evaluation in R and evaluation in S-Plus is that 
S-Plus looks for a global variable called n while R first looks for a variable called n 
in the environment created when cube was invoked. 


## first evaluation in S$ 

S> cube (2) 

Error in sq(): Object "n" not found 
Dumped 

S> n <- 3 

S> cube (2) 

[1] 18 





## then the same function evaluated in R 
R> cube (2) 
[1] 8 


'nttp://cran.r-project .org/doc/manuals/r-release/R-intro.pdf 


Applied Epidemiology Using R 14-Oct-2013 © Tomas J. Aragon (www.medepi.com) 





Solutions 





Problems of Chapter 1 


1.1 To download R go to http://cran.r-project.org/ and follow the 
instructions for your operating system. Once installed, start R. The computer file 
path to the workspace file, .RData, is obtained using the get wd function (see 
Table 1.3 on page 12 for more useful functions). 


> getwd() 
[1] "/home/tja/Data/R/home" 


This displayed in R the file path on the computer. To see the actual . RData file, we 
must enable our computer system to view hidden files and then use the computer’s 
program for viewing files. This is useful to know in case we want to physically move 
the file to another location. 


1.2 To list the R packages currently loaded, use the search function. 





> search () 

[1] ".GlobalEnv" "package:stats" "package: graphics" 
[4] "package:grDevices" "package:utils" "package:datasets" 
[7] “package:methods" "Autoloads" "package:base" 


Alternatively, use the searchpaths function to see the file paths. 


searchpaths () 











> 

[1] ".GlobalEnv" "/usr/1ib64/R/library/stats" 

[3] "/usr/lib64/R/library/graphics" "/usr/lib64/R/library/grDevices" 
[5] "/usr/1lib64/R/library/utils" "/usr/1ib64/R/library/datasets" 
[7] "/usr/1lib64/R/library/methods" "Autoloads" 

[9] "/usr/1ib64/R/library/base" 








1.3 


211 


212 Solutions 


1s () 
rm(list=ls()) 


The 1s function returns a character vector of object names. This character vector 
can be used as the list argument in the rm function to remove all the objects. 
Instead of 1s we could have used the ob ject s function. 


1.4 


inches <- 1:12 
centimeters <- inches*2.54 
chind(inches, centimeters) 


1.5 


celsius <- c(0, 100) 
fahrenheit <- ((9/5)*celsius) + 32 
fahrenheit 


Notice that we used a numeric vector so that the calculation requires fewer steps. 
This is an example of a vectorized (or spreadsheat-like) operation. 


1.6 


celsius <- seq(0, 100, 5) 

celsius 

fahrenheit <- ((9/5)*celsius) + 32 
fahrenheit 





Notice that we used a numeric vector so that the calculation requires fewer steps. 
This is an example of a vectorized (or spreadsheat-like) operation. 


1.7 Suppose you weigh 150 Ibs, what is your weight in kilograms? Hint: Remember 


dimensional analysis? 


Ikg _ 150kg 
01x Sb 28 


Suppose your height is 5 feet 8 inches, What is your height in meters? (5’ 8” = 5’ + 


8/12” = 5.75 feet) 
ie eee 
3.3 ft 3.3 
mywt.lb <- 150 
myht.ft <- 5.75 
mywt.kg <- mywt.1b/2.2 
myht.m <- myht.ft/3.3 
bmi <- mywt.kg/myht.m*2 
bmi 


1.8 


> 7/2 #divide 
Pai] 3.5 
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a 
oO 

x 

od) - 

‘o> 


-2 





X 
Fig. C.1 Plot of y = log, (x) 
> 7%/%2 #integer divide 
ray. 3 
> 7%%2 #modulus = remainder 
[1] 1 


1.9 See Figure C.1. The number e is a very special number. When we take the loga- 
rithm of a number using base e we map the values [0, +¢9) into (—, +0), where the 
log,(1) = 0. More specifically, the number range [0,1] maps into (—c9, 0], and [1, 
-++ee) maps into [0, -++°¢). Of note, the log,(e) = 1. In epidemiology, disease counts 
and physical measurements (e.g., weight) have the asymmetric range [0, +0). The 
natural logarithm transformation allows us to work with values between 0 and 1, in 
a sense unbounding the left tail distribution. 


1.10 See Figure C.2. The logit transformation is a double transformation. First, the 
odds tranformation (R/(1—R)) unbounds the probabilities near 1; second, the nat- 
ural logarithm of the log-odds, or the logit transformation, (log(odds)) unbounds 
the probabilities near 0. In other words, the logit transformation maps the numeric 
range [0,1] into (—s°, +°0), where the log(0.5/(1 —0.5)) = 0. The logit transfor- 
mation allows us to work with probabilities that, of course, have the range [0, 1]. 


1.11 
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risk odds = R/(1-—R) log(risk odds) = logit 


+ a 
< Gj 

i) Se 
= a 
D 

Y oe 

0.0 0.4 0.8 0.0 0.4 0.8 
R R 


Fig. C.2. The logit transformation is a double transformation. First, the odds tranformation (R/(1— 
R)) unbounds the probabilities near 1; second, the logit transformation (log(odds)) additionally 
unbounds the probabilities near 0. 


n <- 365 

per.act.risk <- c(0.5, 1, 5, 6.5, 10, 30, 50, 67)/10000 

risks <- 1-(l-per.act.risk) “n 

risks 

##label risks (optional) 

act <- c("IOI", "ROI", "“IPVI", "IAI", "RPVI", "PNS", "RAI", "IDU") 
names (risks) <- act 

risks 





1.12 For this problem I put the following code into an ASCII text file named 
job01.R: 


n <- 365 

per.act.risk <- c(0.5, 1, 5, 6.5, 10, 30, 50, 67)/10000 

risks <- 1-(l-per.act.risk) “n 

risks 

##label risks (optional) 

act <- c("IOI", "ROI", “IPVI", "IAI", "RPVI", "PNS", "RAI", "IDU") 
names (risks) <- act 

risks 





Here is what happened when I sourced it: 


> source ("/home/t ja/Documents/courses/ph251d/jobs/job01.R") 
> source ("/home/t ja/Documents/courses/ph251d/jobs/ph251d-chp1-job01.R", echo = 
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> 
> 
> 
> 
[ 
[ 


VVV Vv 


0 


n <- 365 

per.act.risk <- c(0.5, 1, 5, 6.5, 10, 30, 50, 67)/10000 
risks <- 1-(l1-per.act.risk) “n 

risks 

] 0.01808493 0.03584367 0.16685338 0.21126678 0.30593011 


1 
6] 0.66601052 0.83951869 0.91402762 


##label risks (optional) 
act <- c("IOI", "ROI", "“IPVI", "IAI", "RPVI", "PNS", "RAI", "IDU") 
names (risks) <- act 
risks 
IOI ROI IPVI IAI RPVI PNS 
-01808493 0.03584367 0.16685338 0.21126678 0.30593011 0.66601052 
RAI IDU 


-83951869 0.91402762 


Conclusion: running source alone runs R commands in a source file but does not 
echo the input and output to the screen unless echo = TRUE. 


1.13 


Tran this code in R (Linux): 


sink ("/home/t ja/Documents/courses/ph251d/jobs/job01.logla") 
source ("/home/t ja/Documents/courses/ph251d/jobs/job01.R") 
sink() #closes connection 


sink ("/home/t ja/Documents/courses/ph251d/jobs/job01.loglb") 


source ("/home/t ja/Documents/courses/ph251d/jobs/job01.R", echo = TRUE) 





sink() #closes connection 


The job01.log1a is empty. Here are the contents of job01.l0g1b: 


> 
> 
> 
> 
[ 
[ 


VVVYV 


0 


n <- 365 

per.act.risk <- c(0.5, 1, 5, 6.5, 10, 30, 50, 67)/10000 
risks <- 1-(l-per.act.risk) “n 

risks 

] 0.01808493 0.03584367 0.16685338 0.21126678 0.30593011 


1 
6] 0.66601052 0.83951869 0.91402762 


##label risks (optional) 
act <- c("IOI", "ROI", “IPVI", "IAI", "RPVI", "PNS", "RAI", "IDU") 
names (risks) <- act 
risks 
IOI ROI IPVI IAT RPVI PNS 
-01808493 0.03584367 0.16685338 0.21126678 0.30593011 0.66601052 
RAI IDU 


-83951869 0.91402762 


Conclusion: running the sink function sends what would normally go to the screen 
to a log file. 
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1.14 Sourcing job02.R at the R command line looks like this: 


> source ("/home/t ja/Documents/courses/ph251d/jobs/job02.R") 
[1] 0.01808493 0.03584367 0.16685338 0.21126678 0.30593011 
[6] 0.66601052 0.83951869 0.91402762 


Conclusion: The source function, without echo = TRUE, will not return any- 
thing to the screen unless the show (or print) function is used to “show” an R 
object. This make complete sense. If one is sourcing a file with thousands of R ex- 
pressions, we do not want to see all those expressions, we only want to selected data 
objects with relevant results. Sinking a file only directs anything that would appear 
on the screen to a log file. 
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Problems of Chapter 2 


2.1 n/a 

2.2 See Table 2.1 on page 28. 

2.3 We can index by position, by logical, and by name—if it exists. 
2.4 Any R object component(s) that can be indexed, can be replaced. 
2.5 Study and practice the following R code. 


tab <- matrix(c(139, 443, 230, 502), nrow = 2, ncol = 2, 

dimnames = list("Vital Status" = c("Dead", "Alive"), 
Smoking = c("Yes", "No"))) 

tab 


# equivalent 
tab <- matrix(c(139, 443, 230, 502), 2, 2) 


Smoking = c("Yes", "No")) 
tab 


# equivalent 

tab <- matrix(c(139, 443, 230, 502), 2, 2) 

rownames (tab) <- c("Dead", "Alive") 

colnames (tab) <- c("Yes", "No") 

names (dimnames(tab)) <- c("Vital Status", "Smoking") 
tab 





2.6 Using the tab object from Solution 2.5, study and practice the following R 
code to recreate Table 2.38 on page 103. 


tab, 1, sum) 
tab, Total = rowt) 
tab2, 2, sum) 
tab2, Total = colt) 
("Vital Status", "Smoking") 


rowt <- apply 
tab2 <- cbhind 
colt <- apply 
tab2 <- rbind 
names (dimnames(tab2)) <- c 
tab2 





~ nm 


2.7 Using the tab object from Solution 2.5, study and practice the following R 
code to calculate row, column, and joint distributions. 


# row distrib 

rowt <- apply(tab, 1, sum) 

rowd <- sweep(tab, 1, rowt, "/") 
rowd 


# col distrib 
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colt <- apply(tab, 2, sum) 
cold <- sweep(tab, 2, colt, "/") 
cold 





# joint distrib 

jtd <- tab/sum(tab); jtd 

distr <- list (row.distribution rowd, 
col.distribution cold, 
joint.distribution = jtd) 


distr 


2.8 Using the tab2 object from Solution 2.6, study and practice the following R 
code to recreate Table 2.39 on page 103. Note that the column distributions from 
Solution 2.7 can also be used. 


risk = tab2[1,1:2]/tab2[3,1:2] 

risk.ratio <- risk/risk[2] 

odds <- risk/(1l-risk) 

odds.ratio <- odds/odds [2] 

ratios <- rbind(risk, risk.ratio, odds, odds.ratio) 
ratios 





Interpretation: The risk of death among non-smokers is higher than the risk of death 
among smokers, suggesting that there may be some confounding. 


2.9 Implement analysis below. 


wdat = read.table("http://www.medepi.net/data/whickham-engl.txt", 
sep = ",", header = TRUE) 





str (wdat) 

wdat.vas = xtabs(~Vital.Status + Age + Smoking, data = wdat) 
wdat.vas 

wdat.tol.vas = apply(wdat.vas, c(2, 3), sum) 

wdat.risk.vas = sweep(wdat.vas, c(2, 3), wdat.tot.vas, "/") 
round (wdat.risk.vas, 2) 


Here are the final results: 


> round(wdat.risk.vas, 2) 
, , Smoking = No 


Age 
Vital.Status 18-24 25-34 35-44 45-54 55-64 65-74 75+ 
Alive 0.98 0.97 0.94 0.85 0.67 0.22 0.00 
Dead 0.02 0.03 0.06 0.15 0.33 0.78 1.00 





, , Smoking = Yes 


Age 
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Vital.Status 18-24 25-34 35-44 45-54 55-64 65-74 75+ 
Alive 0.96 0.98 0.87 0.79 0.56 0.19 0.00 
Dead 0.04 0.02 0.13 0.21 0.44 0.81 1.00 





Interpretation: The risk of death is not larger in non-smokers, in fact it is larger 
among smokers in older age groups.. 


2.10 First, look at the data set athttp://www.medepi.net/data/syphilis89c. 
txt. Then read in. 


std <- read.table("http://www.medepi.net/data/syphilis89c.txt", 
head = TRUE, sep = ",") 

str (std) 

head (std) 

lapply (std, table) 


Creating 3-D array without attaching st d data frame. 


table(std$Race, stdSAge, std$Sex) 
xtabs(~ Race + Age + Sex, data = std) 


Now repeat attaching std data frame using attach function. Study the differ- 
ences. 


attach (std) 

table(Race, Age, Sex) 
xtabs(~ Race + Age + Sex) 
detach (std) 


2.11 


tab.ars <- table(stdSAge, stdSRace, std$Sex) 

# 2-D tables 

tab.ar <- apply(tab.ars, c(l1, 2), sum); tab.ar 
tab.as <- apply(tab.ars, c(1, 3), sum); tab.as 
tab.rs <- apply(tab.ars, c(2, 3), sum); tab.rs 





# 1-D tables 

tab.a <- apply(tab.ars, 1, sum); tab.a 
tab.r <- apply(tab.ars, 2, sum); tab.r 
tab.s <- apply(tab.ars, 3, sum); tab.s 





2.12 For this example, we’ll choose one 3-D array. 





tab.ars <- table(stdSAge, stdSRace, std$Sex) 

# row distrib 

rowt <- apply(tab.ars, c(1, 3), sum) 

rowd <- sweep(tab.ars, c(1, 3), rowt, "/"); rowd 
#confirm 

apply (rowd, c(1, 3), sum) 
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# col distrib 

colt <- apply(tab.ars, c(2, 3), sum) 

cold <- sweep(tab.ars, c(2, 3), colt, "/"); cold 
#confirm 

apply(cold, c(2, 3), sum) 


# joint distrib 

jtt <- apply(tab.ars, 3, sum) 

jtd <- sweep(tab.ars, 3, jtt, "/"); jtd 
#confirm 

apply (jtd, 3, sum) 


distr <- list (row.distribution = rowd, 
col.distribution cold, 
joint.distribution = jtd) 


distr 
2.13 It is a good idea to understand how the rep function works with two vectors: 


> rep(4:6, 1:3) 
[1] 455 6 6 6 


We can see that the second vector determines the frequency of the first vector ele- 
ments. Now use this understanding with the syphilis data. 


sdat89b <- read.csv("http://www.medepi.net/data/syphilis89b.txt") 
str (sdat89b) 

Sex <- rep(sdat89bS$Sex, sdat89bSFreq) 

Race <- rep(sdat89bSRace, sdat89bS$Freq) 

Age <- rep(sdat89bSAge, sdat89bSFreq) 

sdat89.df <- data.frame(Sex, Race, Age) 

Str (sdat89.df) 
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Problems of Chapter 3 


3.1 First, we recognize that this data frame contains aggregrate-level data, not 
individual-level data. Each row represents a unique covariate pattern, and the last 
field is the frequency of that pattern. Because the data frame only has a few rows 
here is one way: 


Status <- rep(c("Dead", "Survived"), 4) 

Treatment <- rep(c("Tobutamide", "Tobutamide", 
"Placebo", "Placebo"), 2) 

Agegrp <- c(rep("<55", 4), rep("55+", 4)) 

Freq <- c(8, 98, 5, 115, 22, 76, 16, 69) 

dat <- data.frame(Status, Treatment, Agegrp, Freq) 

dat 


An alternative, and better way, is to create an array that reproduce the core data 
from Table 3.1 on page 107. Then we use the data. frame and as.table func- 
tions. Here we show a few ways to create this array object. 


#answer 1 
udat <- array(c(8, 98, 5, 115, 22, 76, 16, 69), dim = 
Cl2y: 2222)"; 
dimnames = list (Status = c("Dead", "Survived"), 
Treatment = c("Tolbutamide", "Placebo"), 
Agegrp = c("<55", "55+"))) 
dat <- data.frame(as.table(udat) ) 
dat 


f#fanswer 2 

Status <- rep(c("Dead", "Survived"), 4) 

Treatment <- rep(rep(c("Tolbutamide", "Placebo"), 
c(2, 2)), 2) 

Agegrp <- rep(c("<55", "55+"), c(4, 4)) 

Freq <- c(8, 98, 5, 115, 22, 76, 16, 69) 

dat <- data.frame(Status, Treatment, Agegrp, Freq) 

dat 





#answer 2b, equivalent to 2a 

dat <- data.frame( 
Status = rep(c("Dead", "Survived"), 4), 
Treatment = rep(rep(c("Tolbutamide", "Placebo"), 
c(2, 2)), 2), 
Agegrp = rep(c("<55", "55+"), c(4, 4)), 
Freq = c(8, 98, 5, 115, 22, 76, 16, 69) 

) 

dat 
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3.2 See Chapter. 





3.3 
adat <- read.table("http://www.medepi.net/data/aids.txt", header=TRUE, 
sep="", na.strings=".") 
head (adat) 
plot (adatSyear, adatScases, type = "1", xlab = "Year", lwd = 2, 
ylab = "Cases", main = "Reported AIDS Cases in United States, 1980--2003" 
3.4 


mdat <- read.table("http://www.medepi.net/data/measles.txt", 
header=TRUE, sep="") 















































head (mdat ) 
plot (mdatSyear, mdatScases, type = "1", xlab = "Year", lwd = 2, 
ylab = "Cases", 
main = "Reported Measles Cases in United States, 1980--2003") 
plot (mdatSyear, mdatScases, type = "1", xlab = "Year", lwd = 2, 
log = "y", ylab = "Cases", 
main = "Reported Measles Cases in United States, 1980--2003") 
3.5 
aids <- read.table("http://www.medepi.net/data/aids.txt", sep="", 
header = TRUE, na.strings=".") 
hepb <- read.table("http://www.medepi.net/data/hepb.txt", sep="", 
header = TRUE) 
matplot (hepb$year, cbind(hepbScases,aids$cases) , 
type = "1", lwd = 2, xlab = "Year", ylab = "Cases", 
main = "Reported cases of Hepatitis B and AIDS, 


United States, 1980-2003") 
legend(1980, 100000, legend = c("Hepatitis B", "AIDS"), 
lwd = 2, lity = 1:2, col = 1:2) 


3.6 Answer to (a): 


edat <- read.table("http://www.medepi.net/data/evans.txt", 
header = TRUE, sep="") 

str (edat) 

# 

table (edat $chd) 

edat$chd2 <- factor(edat$chd, levels 
labels = c("No", "Yes")) 

table (edat$chd2) 

# 

table (edat$cat) 

edat$cat2 <- factor(edat$cat, levels 
labels = c("Normal", "High") ) 





ll 
fo) 
a 

. 





ll 
fo) 
a 

~ 
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# 


# 


# 











Answer to (b): 


table (edat$cat2) 


table (edat$smk) 
edatSsmk2 <- factor(edat$smk, levels 
labels = c("Never", "Ever") ) 

table (edat$smk2) 


table (edatSecg) 
edatSecg2 <- factor(edat$ecg, levels 
labels = c("Normal", "Abnormal") ) 

table (edatSecg2) 


table (edatShpt) 
edatShpt2 <- factor(edat$hpt, levels = 0:1, 
labels = c("No", 
table (edatShpt2) 


quantile (edatSage) 
edatSage4 <- cut(edatSage, quantile (edatSage), 


right = FALSE, include.lowest = TRUE) 
table (edatSage4) 





Answer to (c): 


hptnew [normal ] 


hptnew[prehyp] 


hptnew[stagel] 





hptnew[stage2] 








Answer to (d): 








3.7 





hptnew <-— rep(NA, 
normal <-— edatSsbp<120 & edatSdbp<80 


<- 


<- 


<- 


<- 


wdat <- read.table 





header=TRUE, 
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ll 
fo) 
a 

. 





ll 
fo) 
fa 


"Yes") ) 





nrow (edat) ) 


1 


prehyp <- (edatS$sbp>=120 & edatSsbp<140) | 
(edatSdbp>=80 & edatSdbp<90) 


2 


stagel <- (edatSsbp>=140 & edatSsbp<160) | 
(edatSdbp>=90 & edatS$dbp<100) 


3 


stage2 <- edatSsbp>=160 | edat$dbp>=100 


4 


edatShpt4 <- factor(hptnew, levels=1:4, 
labels=c("Normal", "PreHTN", "HTN.Stagel", "HIN.Stage2") ) 
table (edatShpt4) 


table("Old HIN"=edatShpt2, "New HIN"=edatShpt4) 


("http://www.medepi.net/data/wnv/wnv2004raw.txt", 
=",", as.is=TRUE, na.strings=c(".","Unknown") ) 
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str (wdat) 
wdat$date.onset2 <- as.Date(wdatSdate.onset, format="%m/%d/%Y") 
wdatSdate.tested2 <- as.Date(wdat$date.tested, format="%m/%d/%Y 


Ww) 





write.table(wdat, "c:/temp/wnvdat.txt", sep=",", row.names=FALSI 
3.8 See Appendix A.2 on page 184 for Oswego data dictionary. 


a. Using RStudio plot the cases by time of onset of illness (include appropriate 
labels and title). What does this graph tell you? (Hint: Process the text data and 
then use the hist function.) 


Plotting an epidemic curve with this data has special challenges because we have 
dates and times to process. To do this in R, we will create date objects that contain 
both the date and time for each primary event of interest: meal time, and onset time 
of illness. From this we can plot the distribution of onset times (epidemic curve). An 
epidemic curve is the distribution of illness onset times and can be displayed with a 
histogram. First, carefully study the Oswego data set at http: //www.medepi. 
net/data/oswego.txt. We need to do some data preparation in order to work 
with dates and times. Our initial goal is to get the date/time data to a form that can 
be passed to R’s st rptime function for conversion in a date-time R object. To 
construct the following curve, study, and implement the R code that follows: 


odat <- read.table("http://www.medepi.net/data/oswego.txt", 
sep = "", header = TRUE, na.strings = ".") 





str (odat) 

head (odat) 

## Create vector with meal date and time 

mdt <- paste("4/18/1940", odat$meal.time) 

## Convert into standard date and time 
meal.dt <- strptime(mdt, "%Sm/%d/%Y SI:%M %p") 
## create vector with onset date and time 


Gl 


) 


odt <- paste (paste (odatSonset.date,"/1940",sep = ""), odatSonset.time) 


## convert into standard date and time 
onset.dt <- strptime(odt, "%m/%d/SY %1:%M %p") 
hist (onset.dt, breaks = 30, freq = TRUE) 


b. Are there any cases for which the times of onset are inconsistent with the general 
experience? How might they be explained? 


Now that we have our data frame in R, we can identify those subjects that correspond 
to minimum and maximum onset times. We will implement R code that can be 
interpreted as “which positions in vector Y correspond to the minimum values in 
Y?” We then use these position numbers to indexing the corresponding rows in the 
data frame. 


##Generate logical vectors and identify '’which’ position 
min.obs.pos <- which (onset.dt==min (onset.dt,na.rm=T) ) 

min.obs.pos 
max.obs.pos <- which (onset.dt==max (onset.dt,na.rm=T) ) 
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max.obs.pos 

##index data frame to display outliers 
odat [min.obs.pos, ] 

odat [max.obs.pos, ] 


c. How could the data be sorted by illness onset times? 


We can sort the data frame based values of one or more fields. Suppose we want to 
sort on illness status and illness onset times. We will use our onset.times vector we 
created earlier; however, we will need to convert it to “continuous time” in seconds 
to sort this vector. Study and implement the R code below. 


onset.ct <- as.POSIXct (onset.dt) 
odat2 <- odat[order(odat$ill, onset.ct), ] 
odat2 


d. Where possible, calculate incubation periods and illustrate their distribution with 
an appropriate graph. Use the truehist function in the MASS package. De- 
termine the mean, median, and range of the incubation period. 


##Calculate incubation periods 

incub.dt <- onset.dt - meal.dt 

library (MASS) #load MASS package 

truehist (as.numeric(incub.dt), nbins = 7, prob = FALS! 
col = "skyblue", xlab = "Incubation Period (hours)") 


Gl 





##Calculate mean, median, range; remember to remove NAs 
mean(incub.dt, na.rm = TRUE) 

median(incub.dt, na.rm = TRUE) 

range (incub.dt, na.rm = TRUE) 
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